Introduction

Despite its publication nearly 25 years ago, the controversial book The Bell Curve (TBC)1 has persisted in conversations about IQ. The authors, Richard Herrnstein and Charles Murray, claim that IQ strongly influences many significant life events – like whether someone will enter poverty, drop out of school, get married, or even go to jail. Even more contentious, they extend the consequences of low IQ to demographics like ethnicity and nationality. As evidence, the authors build and discuss dozens of probability models using public data.

    Although TBC has been criticized extensively, these responses have generally come from social psychologists and not statisticians. For this report, I intend to evaluate the statistical meaning of 24 of the nearly 3 dozen models via bootstrap optimism on Matthew’s correlation coefficient. In doing so, I will illustrate the impact of two scientific philosophies, explanation and prediction, in the context of presenting statistical models to the public. I will describe why prediction is the only philosophy relevant to the claims in TBC and, finally, that the 24 probability models are inadequate in supporting the authors’ public policy recommendations.

A Note to Readers

This report is in two pieces: a ~25 page written report and a ~75 page appendix with diagnostic information for the underlying TBC models. Please don’t let the total page count discourage you from reading this report! I tried to write in a style similar to that in TBC (style alone, not its content). If a reader could understand the level of math presented in TBC, I hope she could also read this report (though I admit, maybe with a bit more statistical background).

    I originally wrote this for a data science master’s course. A friend had mentioned TBC after hearing an interview with Charles Murray. I had not heard much about TBC at that time. After going through the book, I found it very odd that no raw data (i.e., observed data points, not just modeled curves) was presented to readers, and that causal models were mixed up with predictive models. Since TBC is politically motivated, I wanted to reproduce as many models as possible and do a more thorough analysis. I want to be clear that I never agreed with the arguments or recommendations in TBC. I don’t endorse the book. In fact, this paper demonstrates how these 24 TBC models are unreliable.
    For non-technical readers, I recommend reading the following sections: Background, Comparison of Optimal MCC Values, and Discussion. For statisticians and data scientists, all results are reproducible with an R project made available on Github. See the Methods section for more details. Also, this markdown document uses code folding. To see the underlying code, click the adjacent “CODE” button. From a data science perspective, this project is interesting because it help illustrate:
  • Reproducible research

  • Bootstrap regression

  • Bias-corrected accelerated confidence intervals

  • Automated model and report building


Background

The Bell Curve and its Claims

TBC stands out not for being a stuffy academic paper, but instead as accessible prose intended for the general population. The book is political, pop-psychology by design. TBC can be simplified into 3 sections: a summary of prior IQ research, a presentation of new probability models derived by the authors, and a discussion of the findings including recommendations for public policy. With this being an accessible book, the probability models are described in simple, interpretable terms. In fact, TBC includes extensive discussion on what multivariate analysis means – from explaining what a “variable” means in statistics to what logistic regression attempts to model. The authors even include an entire appendix titled, “Statistics for People Who Are Sure They Can’t Learn Statistics”.

    In this format, it’s clear that the authors want to persuade a general audience and leverage the language of statistics to make their public policy recommendations more credible. Among these recommendations, TBC proposes that the US government should:
  1. Discourage births by poor women:

The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution. We urge generally that these policies, represented by the extensive network of cash and services for low-income women who have babies, be ended.

  1. Spend more on gifted students and less on disadvantaged students:

Reallocate some portion of existing elementary and secondary school federal aid away from programs for the disadvantaged to programs for the gifted… At present, there is an overwhelming tilt toward enriching the education of children from the low end of the cognitive ability distribution. We propose more of a balance across the cognitive ability distribution.

  1. Shift immigration policy away from family reunification and toward individual cognitive assessments:

The rules that currently govern immigration provide the other major source of dysgenic pressure. It appears that the mean IQ of immigrants in the 1980s works out to about 95. The low IQ may not be a problem: in the past, immigrants have sometimes shown large increases on such measures. But other evidence indicates that the self-selection process that used to attract the classic American immigrant – brave, hard working, imaginative, self-starting, and often of high IQ – has been changing, and with it the nature of some of the immigrant population.

It should be among the goals of public policy to shift the flow of immigrants away from those admitted under the nepotistic rules (which broadly encourage the reunification of relatives) and toward those admitted under competency rules…

    Each one of these claims is dependent on the accuracy about how low IQ impacts various indicators relevant to American society. In their construction and review of new probability models, the authors support their claims using the following indicators (or “binary targets” in the language of statistical modeling):
  • Whether an individual will be under the official poverty line
  • Whether an individual will permanently drop out of high school
  • Whether an individual will receive a GED instead of a high school diploma
  • Whether an individual will receive a bachelor’s degree
  • Whether an individual will be out of the labor force for four weeks or more
  • Whether an individual will be unemployed for four weeks or more
  • Whether an individual will ever marry before the age of 30
  • Whether an individual will be divorce within the first 5 years of marriage
  • Whether an individual will go to jail
  • Whether an individual will have “Middle Class Values” as defined by the authors

Data Used in The Bell Curve

The data used to evaluate the relationship between IQ and each of these binary targets is the National Longitudinal Survey of Youth 1979 (NLSY79)2, which is a public dataset made available by the Bureau of Labor Statistics. In this academic study, 12,686 young Americans (aged 14 – 21 at the survey’s start) of varying backgrounds were interviewed each year between 1979 through 1994. The last relevant year for TBC was 1990, although the survey continued after this time. These interviews covered a vast number of topics which were tabulated and, more recently, released online.

    Critics of TBC note that the NLSY79 lacks an explicit measure of cognitive ability. In lieu of such a variable, TBC authors used a military achievement test called the Armed Forces Qualifications Test (AFQT). In 1980, the US Department of Defense and Military Services selected the NLSY79 participants to take a battery of tests that measure knowledge and skill across “general science, arithmetic reasoning, work knowledge, paragraph comprehension, numerical operations, coding speed, auto and shop information, mathematics knowledge, mechanical comprehension, and electronic information.”3 Nearly all NLSY79 participants completed the battery. Of these, TBC authors normalized the AFQT score using a revised percentile score from 1989 and labeled it “zAFQT89”.
    The timing of when the NLSY79 participants took the AFQT (i.e., 1980) is important for this report. In TBC, the binary targets listed above are actually defined for a particular time period. The main takeaway is that the events described by these targets occur in the future relative to when IQ is measured:
  • Under the official poverty line in 1989
  • Permanently dropped out of high school (which would be in the future for some NLSY79 participants at the survey’s start)
  • Received a GED instead of a high school diploma (which would be in the future for some NLSY79 participants at the survey’s start)
  • Received a bachelor’s degree (which would be in the future for some NLSY79 participants)
  • Out of the labor force for four weeks or more in 1989
  • Unemployed for four weeks or more in 1989
  • Ever married before the age of 30 (which would be in the future for all NLSY79 participants at the survey’s start)
  • Divorced within the first 5 years of marriage
  • The subject was interviewed in jail at least once from 1979 to 1990
  • Did the subject score ‘yes’ on the Middle Class Values Index (which depends on information recorded in 1989 and 1990)
    To better isolate the effect of IQ in their probability models, TBC authors include two other variables: socioeconomic status and age. For the latter, the authors use a normalized version of age from NLSY79 participants, which they call “zAge”. For the former, they derive a measure of socioeconomic status, normalize it, and call the result “zSES”. In appendix 2, the authors write: “Since the purpose of the index was to measure the socioeconomic environment in which the NLSY youth was raised, the specific variables employed referred to the parents’ status: total net family income, [total years of] mother’s education, [total years of] father’s education, and an index of occupational status of the adults living with the subject at the age of 14.”. This definition is defined in greater detail across another page of the appendix. However, I omit them here, since a secondary goal of this report is to reproduce these probability models as-is.

Reproducing the Models

Because of the book’s controversy and age, much has been said about TBC. This criticism has primarily come from social psychologists and other experts in intelligence. From my perspective, their comments seem embedded in psychology. E.g., from James Heckman’s “Lessons from the Bell Curve”4:

[The derived zAFQT89 field] is not the same as the g that can be extracted from test scores available in their data set… They do not emphasize how little of the variation in social outcomes is explained by AFQT or g.”

AFQT is an achievement test that can be manipulated by educational interventions… A person’s AFQT score is not an immutable characteristic beyond environmental manipulation.

While there is nothing wrong with these responses, I am surprised how few of them review the probability models described in TBC. In fact, I have only found one detailed, statistical review of a TBC model.

    In 2007, Dr. Claudia Krenz5 posted, in statistical terms, a standard classification analysis of the probability model on poverty using a cutoff value of 50% and the training data to evaluate performance. At a high-level, a classification analysis means that modeled probabilities are converted into a binary field which is then analyzed to evaluate the quality and reliability of the underlying probability model. A cutoff value is typically used to define this conversion, such that a modeled probability above this cutoff value is assumed to be 1. In the context of the poverty model, e.g., NLSY79 participants are assumed to enter poverty when their modeled probabilities are above this cutoff. From Dr. Krenz’s report,

In summary, this web page describes the first logistic regression published in The Bell Curve, in which POVERTY status–either above or below an official level–was predicted by a model consisting of AFQT, SES, and AGE scores from a decade earlier.

I replicated the numbers published in the book’s Appendix 4. I then looked at how well the model fit: it didn’t, not at all in the published sample (N=3367), not much in another independent one (N=1067). I suspect HM didn’t realize they’d made a Type I error, that their model’s predictive accuracy for cases living below the POVERTY level was 0%…

    I found Dr. Krenz’s analysis interesting but thought it could be improved by extending the classification analysis to more TBC models. Namely, this report reviews 24 probability models from TBC. The 10 binary targets are examined on different NLSY79 subsets. E.g., TBC describes two poverty models: one using participants who only graduated high-school and another with participants who graduated either high-school or college. In general, these data subsets are based on education, gender, employment status, marital status, and age. TBC includes more probability models (e.g., based on IQ for mothers and their children in the NLSY79), but I did not have time to reproduce and review them in this paper.
    Whatever my findings would be, I realized that I would need to explain why classification analysis is relevant to evaluating a probability model. Someone unfamiliar with regression analysis might respond: the classification results say nothing about the probability models because they describe different things. In the following sections, I discuss why:
  1. Classification results are needed to have a fair and human-centered evaluation since TBC’s public policy recommendations have large ramifications for individuals
  2. TBC’s probability models are most appropriately evaluated as predictive classifications in the context of statistical philosophy

Human-Centered Evaluation of TBC’s Probability Models

source('200 Code/000_Misc.R')
source('200 Code/100_Resampling.R')
source('200 Code/200_Classification.R')
source('200 Code/300_Visualizations.R')
source('200 Code/400_Model_Definitions.R')

save.folder <- '100 Data'
code.folder <- '200 Code'

model.labels <- tbc.models$label

# Get filenames for needed data components
filenames.output <- file.path(save.folder, paste0(model.labels, '_outputs.rds'))
filenames.input <- file.path(save.folder, paste0(model.labels, '_inputs.rds'))

filenames.mcc.bootCI.training <- file.path(save.folder, paste0(model.labels, '_DT.mcc.bootCI.training.rds'))
filenames.mcc.bootCI.holdout <- file.path(save.folder, paste0(model.labels, '_DT.mcc.bootCI.holdout.rds'))
names(filenames.mcc.bootCI.training) <- model.labels
names(filenames.mcc.bootCI.holdout) <- model.labels

filenames.probability.bootCI.training <- file.path(save.folder, paste0(model.labels, '_DTs.probability.bootCI.training.rds'))
filenames.probability.bootCI.holdout <- file.path(save.folder, paste0(model.labels, '_DTs.probability.bootCI.holdout.rds'))

filenames.classification.bootCI.training <- file.path(save.folder, paste0(model.labels, '_DTs.mcc.classification.bootCI.training.rds'))
filenames.classification.bootCI.holdout <- file.path(save.folder, paste0(model.labels, '_DTs.mcc.classification.bootCI.holdout.rds'))

# Import needed data components
outputs <- rbindlist(
  lapply(filenames.output, readRDS)
  , fill = TRUE
  , use.names = TRUE
)

inputs <- lapply(filenames.input, readRDS)
names(inputs) <- model.labels

DT.mcc.bootCI.training <- rbindlist(lapply(
    seq_along(filenames.mcc.bootCI.training), function(x) 
    readRDS(filenames.mcc.bootCI.training[x])[, label := names(filenames.mcc.bootCI.training[x])]
  )
  , fill = TRUE
  , use.names = TRUE
)
DT.mcc.bootCI.holdout <- rbindlist(lapply(
    seq_along(filenames.mcc.bootCI.holdout)
    , function(x) readRDS(filenames.mcc.bootCI.holdout[x])[, label := names(filenames.mcc.bootCI.holdout[x])]
  )
  , fill = TRUE
  , use.names = TRUE
)

DTs.probability.bootCI.training <- lapply(filenames.probability.bootCI.training, readRDS)
DTs.probability.bootCI.holdout <- lapply(filenames.probability.bootCI.holdout, readRDS)
names(DTs.probability.bootCI.training) <- model.labels
names(DTs.probability.bootCI.holdout) <- model.labels

DTs.mcc.classification.bootCI.training <- lapply(filenames.classification.bootCI.training, readRDS)
DTs.mcc.classification.bootCI.holdout <- lapply(filenames.classification.bootCI.holdout, readRDS)
names(DTs.mcc.classification.bootCI.training) <- model.labels
names(DTs.mcc.classification.bootCI.holdout) <- model.labels

# Create category field
outputs[, category := factor(category, levels = unique(tbc.models$category))]

# Organize bootstrapped coefficients
DT.coef <- rbindlist(
  lapply(outputs$label, 
         function(label.var) {
           DT.label <- as.data.table(outputs[label == label.var])
           
           DT.coef.label <-  as.data.table(
             coef(summary(DT.label[, glm.model][[1]]))
             , keep.rownames = TRUE
           )
           
           DT.coef.label[, label := DT.label$label]

          return(DT.coef.label)
         }
  )
)

DT.coef.pValue <- dcast(DT.coef, formula = label ~ rn, value.var = 'Pr(>|z|)')

# Calculate the mean modeled response from the poverty model (i.e., Pov89_modeled)
Pov89_modeled.2stdev <- inputs[['poverty']][['DT.model.training']][
    , .(
        prob.model = mean(Pov89_modeled, na.rm = TRUE)
    )
    , by = .(zAFQT89_banded = floor(zAFQT89 * 10) / 10)
][
    , .(
        zAFQT89_banded
        , zero = abs(zAFQT89_banded - (-2))
        , prob.model
    )
][order(zero)][1L, prob.model]

Pov89_modeled.2stdev.formatted <- paste0(floor(Pov89_modeled.2stdev * 1e4) / 1e4 * 1e2, '%')

confusionMatrix.poverty <- inputs[['poverty']][['DT.model.training']][
  , .(
    `Did not enter poverty (classification)` = sum(as.integer(Pov89_modeled <= Pov89_modeled.2stdev))
    , `Enter poverty (classification)` = sum(as.integer(Pov89_modeled > Pov89_modeled.2stdev))
  )
  , by = .(Pov89 = ifelse(Pov89 == 0, 'Did not enter poverty (true)', 'Enter poverty (true)'))
]

accuracy.poverty.formatted <- paste0(
  floor(
    as.numeric((confusionMatrix.poverty[1, 2] + confusionMatrix.poverty[2, 3]) / nrow(inputs[['poverty']][['DT.model.training']])) * 1e4
  ) / 1e2
  , '%'
)

probabilityEnterPoverty <- outputs[label == 'poverty', response.mean.training]

probabilityEnterPoverty.formatted <- paste0(floor(probabilityEnterPoverty * 1e4)/1e4 * 1e2, '%')

mcc.poverty <- calculateMCC(
    TN = confusionMatrix.poverty[1, 2] / nrow(inputs[['poverty']][['DT.model.training']])
    , FP = confusionMatrix.poverty[1, 3] / nrow(inputs[['poverty']][['DT.model.training']])
    , FN = confusionMatrix.poverty[2, 2] / nrow(inputs[['poverty']][['DT.model.training']])
    , TP = confusionMatrix.poverty[2, 3] / nrow(inputs[['poverty']][['DT.model.training']])
)

mcc.poverty.formatted <- paste0(floor(mcc.poverty * 1e4) / 1e4 * 1e2, '%')

TNR.poverty <- as.numeric(confusionMatrix.poverty[1, 2] / (confusionMatrix.poverty[1, 2] + confusionMatrix.poverty[1, 3]))
TPR.poverty <- as.numeric(confusionMatrix.poverty[2, 3] / (confusionMatrix.poverty[2, 3] + confusionMatrix.poverty[2, 2]))

Extraordinary claims require extraordinary evidence. When a recommendation for public policy can negatively impact a group of people, material evidence is required to support the claim. With TBC, they cite their probability models as reason to accept their conclusions. In order to assess if these models are meaningful and reliable, they should be evaluated from the perspectives of those who could be harmed by the underlying claims.

    This evaluation first requires that the potentially harmed population be well-defined. Revisiting TBC’s claims, the authors say (with my added emphasis):

The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution.

By saying “disproportionately”, the authors imply that there is some IQ value below which the negative consequences of their probability models become most severe and worthy of political action. More specifically, if the authors are to objectively identify individuals they think have too low of IQ, they would need to state such an IQ value. And if their probability models are to be taken seriously, then the authors should use them to clarify what “disproportionate” means. Below, I play devil’s advocate for TBC authors by finding such a disproportionate IQ.

    Focusing on a binary target with a negative outcome like poverty, a “cutoff IQ” value could – hypothetically – be defined as the IQ where the average of modeled probabilities across a dataset is, e.g., above 31.99% (which is two standard deviations below the mean frequency of entering poverty, i.e., where zAFQT89 = -2 in the training data). Said more simply, individuals would be flagged as having too low of IQ when the probability model suggests they are at least as likely to enter poverty as an “abnormally low” IQ defined by the training data.
    The figure below shows the training data used to build the poverty model (in data points) and the modeled probability built from that data (as a fitted curve). The red vertical line shows the “cutoff IQ” and the red horizontal line shows the associated modeled probability (i.e., 31.99%) that could be used to define a population impacted by the authors’ recommendations. (Note: This figure will be discussed in more detail later on. For now, it is displayed to motivate “cutoff IQs and probabilities”.)
plot.training.actual <- ggplot(
  inputs[['poverty']][['DT.model.training']][
    , .(
      response = mean(Pov89)
      , modeled = mean(Pov89_modeled)
    )
    , by = .(
      zAFQT89_banded = floor(zAFQT89 * 10)/10
    )
  ]
  , aes(
    x = zAFQT89_banded
    , y = response
  )
) + 
  geom_point() + 
  theme_bw() + 
  ylab(paste('%', tbc.models[label == 'poverty', target.description])) 
# +
#   labs(
#     color = 'Category'
#     , shape = 'Category'
#   )

ggplotly(
  plot.training.actual + 
    geom_line(aes(y = modeled), color = lightBlue) + 
    geom_segment(
      x = -3
      , xend = -2
      , y = Pov89_modeled.2stdev
      , yend = Pov89_modeled.2stdev
      , linetype = 2
      , color = 'red'
    ) + 
    geom_segment(
      x = -2
      , xend = -2
      , y = 0
      , yend = Pov89_modeled.2stdev
      , linetype = 2
      , color = 'red'
    )
)

Plot of observed vs modeled probabilities across zAFQT89

    It’s also important to realize that all the binary targets in TBC tend to have monotonic relationship with IQ, e.g., individuals with lower IQ tend to have greater observed frequencies of entering poverty in 1989. Combining this knowledge with the description of a “cutoff IQ” above, each TBC model could identify individuals impacted by the authors’ recommendations using either a cutoff value for IQ OR modeled probability, i.e., they are linked. From the figure above, the “cutoff IQ” of -2 is linked with the “cutoff probability” of 31.99%.
    Recalling Dr. Krenz’s analysis, a cutoff value used in classification acts in the same way as a modeled probability linked to a “cutoff IQ”. Hence, a classification analysis of TBC’s probability models is a human-centered evaluation that is necessary to balance the authors’ controversial claims.

Statistical Philosophy in The Bell Curve

While the prior section may sound trivial to a statistician or data scientist, my point is that fairness recommends a classification analysis. In this section, I go further by claiming that TBC’s probability models should only be evaluated as predictive classifications based on the statistical philosophy best describing TBC’s models.

    Dr. Galit Shmueli has written many publications about two common statistical philosophies: explanation (or sometimes called causality) and prediction6. Dr. Shmueli’s main premise is (with my added emphasis):

[Explanation and prediction] are often conflated, yet the causal versus predictive distinction has a large impact on each step of the statistical modeling process and on its consequences. Although not explicitly stated in the statistics methodology literature, applied statisticians instinctively sense that predicting and explaining are different.

At a high-level, explanatory models:

  • Describe a direct, causal mechanism among the model variables and the model target
  • Are retrospective by focusing on historical data
  • Are motivated by extensive theory
  • Prioritize the average case of the training data
  • Are evaluated more at the individual parameter level and often with the training data alone

Whereas predictive models:

  • Describe indirect associations among the model variables and the model target
  • Are prospective by focusing on future data
  • Are motivated more by observed data than by theory
  • Prioritize the individual cases of the underlying data
  • Are evaluated more at the model level and often on one or more holdout datasets

How do these criteria play out in TBC? Quotes from the authors’ reflect their thoughts on explanatory modeling in TBC:

  1. The probability models are generally assumed to be explanatory in nature:

Part II was circumscribed, taking on social behaviors one at a time, focusing on causal roles

If after looking at a variety of these other things which both theory and common sense say should have some bearing on whether a person ends up in poverty, but one ends up with a large, statistically independent role for I.Q., it seems to me to make a causal statement that I.Q. looks like its a cause of poverty, it is a reasonable thing to do.7

By the end of the chapter [on poverty], we will have drawn a controversial conclusion. How did we get there? What makes us think that we have got our causal ordering right? We will walk through the analyses that lie behind our conclusions

  1. The authors focus on individual parameters, namely IQ, instead of the overall model (R-squared in the quote below):

Even an inherently strong relationship can result in low values of R2 if the data points are bunched in various ways, and relatively noisy relationships can result in high values if the sample includes disproportionate numbers of outliers… We therefore consider the regression coefficients themselves (and their associated p values) to suit our analytic purposes better than R2, and that is why those are the ones we relied on in the text.

Note: The authors do have a point that R-squared is not the most meaningful statistic for every model. However, they did not adequately motivate why probability models with rare events, as in the poverty model with an overall frequency of 7.24%, should ignore diagnostics at the model level.

Below, I comment on the role of predictive modeling in TBC:

  1. The binary targets are not materially causal because any proposed mechanism from IQ, SES, and Age is not direct enough. Focusing again on the poverty model, many explanations are possible as to why an individual would enter poverty. Saying IQ is causal of poverty would require linking IQ with the complex process that ends with the NLSY79 participant below the poverty line. That said, such explanations would certainly be in the realm of associations, because there’s considerable chance (i.e., noise) involved.

  2. As mentioned in the section introducing data used in TBC, most binary targets are defined to be measured nearly a decade in the future (relative to when IQ was measured). Undoubtedly, modeling such targets is a prospective process.

  3. The probability models in TBC should prioritize the individual. In the prior section, I explained why fairness recommends the individual-focused classification analysis. Also, TBC’s authors frequently comment on the importance of acknowledging individuals and not to be judgmental: “We cannot think of a legitimate argument why any encounter between individual whites and blacks need be affected by the knowledge that an aggregate ethnic difference in measured intelligence is genetic instead of environmental.”

Overall, I claim that TBC’s probability models more closely follow predictive philosophy than explanatory, and so they should be evaluated as such.

Why Use Matthew’s Correlation Coefficient?

Many of the binary targets are uncommon-to-rare in occurrence. In logistic regression, this is called imbalanced data and can make some statistics unreliable. To counter this imbalance, I estimate Matthew’s Correlation Coefficient (MCC), which is the correlation between the binary target and the binary classification derived from a TBC probability model. MCC can be interpreted in the same way as a regular (Pearson) correlation. The following guideline is a reasonable starting point to judge MCC values8:

Size of Correlation Interpretation
0.90 to 1.00 (−0.90 to −1.00) Very high positive (negative) correlation
0.70 to 0.90 (−0.70 to −0.90) High positive (negative) correlation
0.50 to 0.70 (−0.50 to −0.70) Moderate positive (negative) correlation
0.30 to 0.50 (−0.30 to −0.50) Low positive (negative) correlation
0.00 to 0.30 (0.00 to −0.30) negligible correlation

MCC is robust to imbalanced data because it accounts for all possible classification scenarios9 (i.e., in the equation below, each scenario is represented in the numerator):

  • True Positive (TP): number of observations where classification expects the target event to occur and it did according to the binary target
  • False Positive (FP): number of observations where classification expects the target event to occur but it did not according to the binary target
  • True Negative (TN): number of observations where classification expects the target event not to occur and it did not according to the binary target
  • False Negative (FN): number of observations where classification expects the target event not to occur but it did according to the binary target

\[{\displaystyle {\text{MCC}}={\frac {{\mathit {TP}}\times {\mathit {TN}}-{\mathit {FP}}\times {\mathit {FN}}}{\sqrt {({\mathit {TP}}+{\mathit {FP}})({\mathit {TP}}+{\mathit {FN}})({\mathit {TN}}+{\mathit {FP}})({\mathit {TN}}+{\mathit {FN}})}}}}\]

    To better understand imbalanced data, the figure below compares the modeled probabilities for NLSY79 participants to enter poverty with their true outcomes in 1989. Each data point represents a NLSY79 participant included in the poverty model (with its exact x-axis location “jittered” in order to see data volumes more easily). The red horizontal line represents the -2 standard deviation cutoff probability of 31.99% described above. The four rectangular regions represent each of the four classification scenarios. Below the figure is a table showing the total number of participants in each classification scenario.
ggplot() +
  geom_jitter(
    data = inputs[['poverty']][['DT.model.training']]
    , aes(
      x = factor(ifelse(Pov89 == 0, 'No', 'Yes'))
      , y = Pov89_modeled
    )
  ) +
  geom_hline(
    yintercept = Pov89_modeled.2stdev
    , linetype = 2
    , color = 'red'
  ) +
  geom_rect(
    data = data.table(
      xstart = rep(c(0.5, 1.5), 2)
      , xend = rep(c(1.5, 2.5), 2)
      , ystart = c(0, 0, Pov89_modeled.2stdev, Pov89_modeled.2stdev)
      , yend = c(Pov89_modeled.2stdev, Pov89_modeled.2stdev, 1, 1)
      , color = c('TN', 'FN', 'FP', 'TP')
    )
    , aes(
      xmin = xstart
      , xmax = xend
      , ymin = ystart
      , ymax = yend
      , fill = factor(color)
    )
    , alpha = 0.5
  ) +
  labs(fill = 'Classification Scenarios:') +
  xlab('Under the official poverty line in 1989') +
  ylab('Modeled Probability') +
  theme_bw() +
  theme(
    axis.text=element_text(size = 16)
    , axis.title=element_text(size = 16)
    , legend.text=element_text(size = 16)
    , legend.title=element_text(size = 16)
    , legend.position="bottom"
  )
Plot comparing different classification scenarios

Plot comparing different classification scenarios

kable(
  confusionMatrix.poverty[
    , .(
      `Poverty Model Target` = Pov89
      , `Did not enter poverty (classification)`
      , `Enter poverty (classification)`
    )
  ]
  , caption = 'Table comparing different classification scenarios (i.e., confusion matrix)'
  , align = rep('c', 3)
) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = 'center')
Table comparing different classification scenarios (i.e., confusion matrix)
Poverty Model Target Did not enter poverty (classification) Enter poverty (classification)
Did not enter poverty (true) 3093 30
Enter poverty (true) 226 18
    A common classification performance metric is called “accuracy”, which here means the percent of NLSY79 participants that were correctly classified as entering poverty or not in 1989. At the cutoff probability of 31.99%, the accuracy is 92.39%. While this seems very high, the above figure shows its weakness: very few of the NLSY79 participants that truly entered poverty were classified to do so (as shown in the purple ‘TP’ region and the above table). Instead, most of the “true” classifications came from participants who did not enter poverty in 1989 (as shown in the blue ‘TN’ region). Hence, the accuracy metric is reflecting that the binary target is rare, more than it’s describing the performance of the underlying model. In comparison, the MCC value using the same cutoff probability is 14.03%, which better agrees with the poorly populated ‘TP’ region in purple.
    Lastly, we can see the impact of rare events on these performance metrics through simulations. In particular, we can hold constant the rate of true positive and true negative classifications (TPR and TNR, resp.) while varying the rarity of the binary target, and then calculate accuracy and MCC. The figure below shows the simulated accuracy and MCC values at the TPR and TNR values observed in the poverty model. The takeaway is that MCC accounts for rare binary targets whereas accuracy does not (and so MCC is more reliable).
simulateConfusionMatrix.constant.TNR.TPR <- function(TNR, TPR, pctPositive) {
  TP <- pctPositive * TPR
  TN <- (1 - pctPositive) * TNR
  
  output <- list(
    TN = TN
    , FN = pctPositive * (1 - TPR)
    , FP = (1 - pctPositive) * (1 - TNR)
    , TP = TP
  )
  
  return(output)
}

pctPositive.range <- seq(from = 0, to = 1, length.out = 101)
TPR.range <- seq(from = 0, to = 1, length.out = 101)
TNR.range <- seq(from = 0, to = 1, length.out = 101)
DT.demoMCC <- as.data.table(
  expand.grid(
    pctPositive.range
    , TPR.range
    , TNR.range
  )
)
setnames(DT.demoMCC, paste0('Var',1:3), c('pctPositive', 'TPR', 'TNR'))
DT.demoMCC[, c('TN', 'FN', 'FP', 'TP') := as.list(simulateConfusionMatrix.constant.TNR.TPR(TNR, TPR, pctPositive))]
DT.demoMCC[, mcc := calculateMCC(TN, FN, FP, TP)]
DT.demoMCC[, accuracy := (TN + TP) / (TN + FN + FP + TP)]
DT.demoMCC[, TPR_TNR := factor(paste0(TPR,'_',TNR))]

ggplot(
  DT.demoMCC[TPR == floor(TPR.poverty * 1e2) / 1e2 & TNR == floor(TNR.poverty * 1e2) / 1e2]
  , aes(
    x = pctPositive
    , y = mcc
    , color = 'mcc'
  )
) + 
  geom_line() + 
  geom_line(
    aes(
      y = accuracy
      , color = 'accuracy'
    )
  ) +
  geom_vline(aes(
      xintercept = probabilityEnterPoverty
      , color = 'poverty'
    )
      , linetype = 2
      , size = 1.25
      , show.legend = FALSE
    ) +
  scale_colour_manual(
    name = 'Classification Metrics:\n(TPR = 0.07, TNR = 0.99)'
    , values = c(
      'mcc' = 'darkred'
      ,'accuracy' = 'blue'
      ,'poverty' = 'black'
    )
    , labels = c('Accuracy', "Matthew's Correlation Coefficient", 'Poverty Model (Training Data)')
  ) +
  guides(
      color = guide_legend(
        nrow = 2
        , byrow = TRUE
        , override.aes = list(
            linetype = c("solid", "solid", "dotted")
            , size = c(1, 1, 1.25)
        )
      )
    ) +
  ylab('Performance Metric Percent') +
  xlab('Percent of True Observations\n(i.e., small value means rare event, large value means common event)') +
  theme_bw() +
  theme(
    axis.text=element_text(size = 16)
    , axis.title=element_text(size = 16)
    , legend.text=element_text(size = 16)
    , legend.title=element_text(size = 16)
    , legend.position="bottom"
  )
Plot comparing accuracy and Matthews Correlation Coefficient as classification metrics for imbalanced data

Plot comparing accuracy and Matthews Correlation Coefficient as classification metrics for imbalanced data

Why Use Bootstrapping Statistics?

The most important statistical tool for this analysis is called bootstrapping10, which is used for the following:

  • Estimate the range of MCC values expected to contain the “true” value

  • Account for any abnormalities in the original data that may bias its MCC values

    At a high-level, bootstrapping creates a new data sample by randomly selecting from the original data with the same total observations and including the chance to repeatedly pick the same observation many times. E.g., if a data set has five observations (1, 2, 3, 4, 5), a random bootstrapped sample from this data could be (2, 2, 1, 3, 5). Another bootstrapped sample could be (4, 3, 1, 5, 2). Both samples have the same total observations (i.e., 5) and one sample has duplicate observations.
    Bootstrapping can more informally be viewed as randomly assigning (integer) weights to each original observation. Some bootstrapped observations get more weight than others, which would be reflected in a probability model using this data. Alternatively, some observations would not be included at all.

Bootstrapped Distribution

How does bootstrapping give us a range of MCC values? For each randomly generated bootstrapped sample, we can build a new probability model and use it to calculate an MCC value.

    In the following plot, two samples are bootstrapped from the poverty training data. Each sample is used to fit a new probability model shown below as dashed lines (with the solid black line representing the modeled probability using the original data). You can see that the bootstrapped probabilities are very close to the original modeled probabilities – but not exactly the same.
model <- 'poverty'

set.seed(123)
DT.boot.demo <- logit.bootstrap(
  DT = inputs[[model]][['DT.model.training']]
  , iterations = 2
  , formula = inputs[[model]][['model parameters']][, formula][[1]]
  , DT.holdout = inputs[[model]][['DT.model.holdout']]
  , key = 'ID'
  , parallel = FALSE
)

appendBootDemoResults <- function(DT.model, DT.boot, label, iteration) {
  DT.boot <- copy(DT.boot)[, Pov89_modeled.scored := modeled_probability]
  
  output <- DT.model[
    DT.boot[
      bootIteration == iteration
      , mget(c('ID','Pov89_modeled.scored'))
    ]
    , on = 'ID'
  ][, scoredData := label]
  
  return(output)
}

DT.plot.boot.demo <- rbindlist(list(
  appendBootDemoResults(
    DT.model = inputs[[model]][['DT.model.training']]
    , DT.boot = DT.boot.demo[['modeled training']]
    , label = 'Scored Original Data 1'
    , iteration = 1
  )
  , appendBootDemoResults(
    DT.model = inputs[[model]][['DT.model.training']]
    , DT.boot = DT.boot.demo[['modeled training']]
    , label = 'Scored Original Data 2'
    , iteration = 2
  )
  , appendBootDemoResults(
    DT.model = inputs[[model]][['DT.model.training']]
    , DT.boot = DT.boot.demo[['modeled bootstrap training']]
    , label = 'Bootstrapped Sample 1'
    , iteration = 1
  )
  , appendBootDemoResults(
    DT.model = inputs[[model]][['DT.model.training']]
    , DT.boot = DT.boot.demo[['modeled bootstrap training']]
    , label = 'Bootstrapped Sample 2'
    , iteration = 2
  )
))

(plot.training.actual + 
  geom_line(aes(y = modeled)) + 
  geom_line(
    data = DT.plot.boot.demo[
      scoredData %like% 'Boot'
      , .(
        counts = .N
        , modeled = mean(Pov89_modeled.scored)
      )
      , by = .(
        scoredData
        , zAFQT89_banded = floor(zAFQT89 * 10)/10
      )
      ]
    , aes(
      x = zAFQT89_banded
      , y = modeled
      , color = factor(scoredData)
    )
    , linetype = 2
  ) +
  guides(color = 'none')) %>% 
  ggplotly() %>%
  layout(
    legend = list(
      orientation = "h"
      , xanchor = "center"
      , x = 0.5
      , y = 0
    )
  )

Plot showing example of bootstrapped modeled probabilities

    These small variations in modeled probabilities let us generate a distribution of MCC values. For each set of bootstrapped probabilities, we can specify a cutoff probability (to create classifications) and then calculate a MCC value. The following table shows the MCC values for each dataset used in the latter plot. By repeating this process thousands of times, we can be more confident in the possible MCC values for a particular cutoff probability. Lastly, we can extend this by using a large range of cutoff probabilities, which can be used to find the best possible MCC value for a given TBC model.
mcc.boot.demo <- c(
  calculateMCC.boot(
    DT.probabilities = inputs[[model]][['DT.model.training']]
    , probabilityString = 'Pov89_modeled'
    , actualString = 'Pov89'
    , bootLabelString = ''
    , numCutoffs = 1
    , singleCutoff = 0.15
  )[, mcc]
  , calculateMCC.boot(
    DT.probabilities = DT.boot.demo[['modeled bootstrap training']][bootIteration == 1]
    , probabilityString = 'modeled_probability'
    , actualString = 'actual'
    , bootLabelString = ''
    , numCutoffs = 1
    , singleCutoff = 0.15
  )[, mcc]
  , calculateMCC.boot(
    DT.probabilities = DT.boot.demo[['modeled bootstrap training']][bootIteration == 2]
    , probabilityString = 'modeled_probability'
    , actualString = 'actual'
    , bootLabelString = ''
    , numCutoffs = 1
    , singleCutoff = 0.15
  )[,mcc]
)

DT.mcc.boot.demo <- data.table(
  Data = c(
    'Original Poverty Training'
    , 'Bootstrapped Sample 1'
    , 'Bootstrapped Sample 2'
  )
  , MCC = mcc.boot.demo
)

kable(
  DT.mcc.boot.demo
  , caption = 'Bootstrapped MCC Values\n(Cutoff = 0.15)'
  , align = c('l', 'c')
) %>%
  kable_styling(
    bootstrap_options = "striped"
    , full_width = F
  )
Bootstrapped MCC Values (Cutoff = 0.15)
Data MCC
Original Poverty Training 0.2228674
Bootstrapped Sample 1 0.2738615
Bootstrapped Sample 2 0.2094425

Bootstrapped Optimism

How can bootstrapping help account for any abnormalities in the original data? By abnormality, I mean a set of observations in the original data that poorly (or even favorably) reflects the larger population. These abnormalities could bias certain statistics (e.g., MCC values) and lead to unreasonable conclusions. Bootstrapping can reduce the impact of these abnormalities by comparing statistics from:

  • Bootstrapped models applied to the same bootstrapped sample

  • Bootstrapped models applied to the original data

  • Models built from and applied to the original data

    A MCC value calculated from the original data may be biased due to abnormalities. In the context of TBC, e.g., there could be too many NLSY79 participants with extreme IQ values (low or high). We can account for this possibility by removing the “biased portion”" of that MCC value. This bias is called “optimism”11, since often the bias tends to overstate a model’s performance.
    For an explicit example, the following plot shows the modeled probabilities (as dashed lines) calculated by the bootstrapped models from the prior section but applied to the original data. Recall that a bootstrapped model uses different weights for each observation relative to the original data. This means that the modeled probabilities in the plot give a slightly different perspective than those built from the original data. In fact, you can see that all modeled probabilities have the same spike near zAFQT89 = -2. However, the three sets of lines rarely overlap in this area.
(plot.training.actual + 
  geom_line(aes(y = modeled)) + 
  geom_line(
    data = DT.plot.boot.demo[
      scoredData %like% 'Score'
      , .(
        counts = .N
        , modeled = mean(Pov89_modeled.scored)
      )
      , by = .(
        scoredData
        , zAFQT89_banded = floor(zAFQT89 * 10)/10
      )
      ]
    , aes(
      x = zAFQT89_banded
      , y = modeled
      , color = factor(scoredData)
    )
    , linetype = 2
  ) +
  guides(color = 'none')) %>%
  ggplotly() %>%
  layout(
    legend = list(
      orientation = "h"
      , xanchor = "center"
      , x = 0.5
      , y = 0
    )
  )

Plot showing example of modeled probabilities from bootstrapped model applied to the original data

mcc.optimism.demo <- c(
  calculateMCC.boot(
    DT.probabilities = DT.boot.demo[['modeled training']][bootIteration == 1]
    , probabilityString = 'modeled_probability'
    , actualString = 'actual'
    , bootLabelString = ''
    , numCutoffs = 1
    , singleCutoff = 0.15
  )[, mcc]
  , calculateMCC.boot(
    DT.probabilities = DT.boot.demo[['modeled training']][bootIteration == 2]
    , probabilityString = 'modeled_probability'
    , actualString = 'actual'
    , bootLabelString = ''
    , numCutoffs = 1
    , singleCutoff = 0.15
  )[,mcc]
)

DT.mcc.optimism.demo <- data.table(
  Data = c(
    'Bootstrapped Sample 1'
    , 'Bootstrapped Sample 2'
  )
  , `Applied to Bootstrapped Model` = mcc.boot.demo[-1]
  , `Applied to Original Model` = mcc.optimism.demo
  , `Difference in MCC Values` = mcc.boot.demo[-1] - mcc.optimism.demo
)

optimism.demo <- mean(mcc.boot.demo[-1] - mcc.optimism.demo)
    The table below shows the MCC values calculated when applying the bootstrapped models from the prior section to the bootstrapped samples and original data. The last column gives the difference in MCC value for each row. We can calculate the bootstrapped optimism for this example by taking the average difference in MCC values (i.e., average of the 4th column below), 1.84%.
kable(
  DT.mcc.optimism.demo
  , caption = 'Bootstrapped Optimism for MCC (Cutoff = 0.15)'
  , align = c('l', rep('c', 3))
) %>%
  kable_styling(
    bootstrap_options = "striped"
    , full_width = F
  ) %>%
  add_header_above(c(' ' = 1, 'MCC' = 2, ' ' = 1)) %>%
  row_spec(2, hline_after = TRUE)
Bootstrapped Optimism for MCC (Cutoff = 0.15)
MCC
Data Applied to Bootstrapped Model Applied to Original Model Difference in MCC Values
Bootstrapped Sample 1 0.2738615 0.2184252 0.0554363
Bootstrapped Sample 2 0.2094425 0.2279203 -0.0184778
    Returning to our original goal in this section, we remove the biased portion of MCC calculated from the original data set (which is referred to as the “apparent value” to distinguish from the different types of bootstrapped values) by subtracting the optimism estimate. We call this the “optimism-corrected MCC value”, which is used later in this report. As noted in the prior section, the optimism routine here would be repeated thousands of times in order to have a more reliable MCC estimate.

\[ \begin{aligned} \text{Optimism-corrected MCC Value} &= \text{Apparent MCC Value} - \text{Optimism Estimate} \\ &= 0.22286 - 0.01847 \\ &= 0.20438 \\ \end{aligned} \]

Data

Since reproducing TBC’s models is a goal for this project, I want the exact data used by the authors. With the NLSY79 being a public dataset, this is possible. However, the NLSY79 uses a special variable coding system which was not noted in TBC. To credit the authors, they did release their modified datasets online and included a data dictionary12. In order to reproduce TBC’s results as closely as possible, I opted to use the authors’ data (i.e., “Nation.txt”)

Training vs Holdout data

Another consideration is how the training and holdout datasets are defined. The authors’ data includes a variable called “Sample”. As described in the NLSY79 User’s Guide, this Sample variable has the following 3 values:

  1. A cross-sectional sample of 6,111 respondents designed to be representative of the noninstitutionalized civilian segment of young people living in the United States in 1979 and born between January 1, 1957, and December 31, 1964 (ages 14–21 as of December 31, 1978)
  1. A supplemental sample of 5,295 respondents designed to oversample civilian Hispanic, black, and economically disadvantaged non-black/non-Hispanic youth living in the United States during 1979 and born between January 1, 1957, and December 31, 1964
  1. A sample of 1,280 respondents designed to represent the population born between January 1, 1957, and December 31, 1961 (ages 17–21 as of December 31, 1978), and who were enlisted in one of the four branches of the military as of September 30, 1978

For a given TBC model to reproduce, the training data is based on the cross-sectional partition (which is exactly what TBC’s authors do). The holdout data is based on the supplemental partition. This selection for a holdout dataset is motivated from Dr. Krenz’s classification analysis, where she used the supplemental sample as a holdout.

Data and Model Specifications

Also needed for reproduction are the exact model specifications. The authors listed parameter and diagnostic output for each model in appendix 4 of TBC. I used these to experiment with the “Nation.txt”" dataset in order to reproduce TBC’s models as closely as possible. 23 of the 24 reproduced models agreed with the book’s listed parameter values up to 4 decimal places or more. The 24th model reproduced the parameter values to a similar precision but the model’s categorical variable had different values. I suspect these differences come from the authors’ use of an older version of STATA, and myself using R. (Perhaps the two modeling implementations handle categorical levels differently in the design matrix, causing different parameter values.)

Below I list the names I give for each TBC model including their:

  • General category (used later to group the models)
  • Description of the binary target (mostly taken from TBC)
  • Number of NLSY79 Participants for each data partition
  • Frequency of Binary Target for each data partition
## Note: The following variables are not present in Nation.txt. I created these to facilitate reproducing TBC's models. See TBC_Bootstrap for details.
# DONT RUN
#   GEDvHSGr_Ind = ifelse(GEDvHSGr == 'GED', 1, 0)
#   LTHSvHS_Ind = ifelse(LTHSvHS == 'LTHS', 1, 0)
#   WedBy30_Ind = ifelse(WedBy30 == '1 Yes', 1, 0)
# DONT RUN

kable(
  outputs[tbc.models[, .(label, labelAppendix)], on = 'label'][
    , .(
      `Model` = labelAppendix
      , Category = category
      , `Description of Binary Target` = target.description
      , `(Training)` = nobs.training
      , `(Holdout)` = nobs.holdout
      , `(Training)` = floor(response.mean.training * 1e3)/1e3
      , `(Holdout)` = floor(response.mean.holdout * 1e3)/1e3
    )
  ]
  , caption = 'Table of TBC model properties'
  , align = c('l', 'c', 'l', rep('c', 4))
) %>%
  kable_styling(
    bootstrap_options = "striped"
    , full_width = F
  ) %>%
  add_header_above(c(' ' = 3, 'NLSY79 Participants' = 2, 'Frequency of Binary Target' = 2))
Table of TBC model properties
NLSY79 Participants
Frequency of Binary Target
Model Category Description of Binary Target (Training) (Holdout) (Training) (Holdout)
Poverty Poverty Under the official poverty line in 1989 3367 1067 0.072 0.147
Poverty (Highschool Subset) Poverty Under the official poverty line in 1989 1236 309 0.080 0.097
Dropout Education Permanently dropped out of high school 3572 1090 0.101 0.272
Dropout (With Interaction) Education Permanently dropped out of high school 3572 1090 0.101 0.272
Get GED Education Received a GED instead of a high school diploma 3494 941 0.081 0.157
Get Bachelors Education Received a bachelor’s degree 3857 1238 0.253 0.212
Out of the Labor Force Employment Out of the labor force for four weeks or more in 1989 1686 488 0.102 0.168
Out of the Labor Force (Highschool Subset) Employment Out of the labor force for four weeks or more in 1989 621 126 0.072 0.142
Out of the Labor Force (College Subset) Employment Out of the labor force for four weeks or more in 1989 268 58 0.063 0.051
Unemployed Employment Unemployed for four weeks or more in 1989 1397 370 0.071 0.108
Unemployed (Highschool Subset) Employment Unemployed for four weeks or more in 1989 537 99 0.074 0.080
Unemployed (College Subset) Employment Unemployed for four weeks or more in 1989 228 53 0.048 0.037
Ever Married Marriage Ever married before the age of 30 1634 670 0.787 0.788
Ever Married (Highschool Subset) Marriage Ever married before the age of 30 605 179 0.839 0.832
Ever Married (College Subset) Marriage Ever married before the age of 30 237 130 0.696 0.769
Divorced in 1st 5 Years of Marriage Marriage Divorced within the first 5 years of marriage 2031 745 0.198 0.217
Divorced in 1st 5 Years of Marriage (Highschool Subset) Marriage Divorced within the first 5 years of marriage 875 252 0.195 0.253
Divorced in 1st 5 Years of Marriage (College Subset) Marriage Divorced within the first 5 years of marriage 209 93 0.071 0.086
Divorced in 1st 5 Years of Marriage (With Parents Factors) Marriage Divorced within the first 5 years of marriage 2029 745 0.198 0.217
Surveyed in Jail Crime Subjects interviewed in jail at least once from 1979 to 1990 1945 554 0.027 0.084
Surveyed in Jail (Highschool Subset) Crime Subjects interviewed in jail at least once from 1979 to 1990 716 143 0.011 0.034
“Middle Class Values” “Middle Class Values” Subjects scored ‘yes’ on the Middle Class Values Index 3029 1064 0.508 0.339
“Middle Class Values” (Highschool Subset) “Middle Class Values” Subjects scored ‘yes’ on the Middle Class Values Index 1162 303 0.593 0.498
“Middle Class Values” (College Subset) “Middle Class Values” Subjects scored ‘yes’ on the Middle Class Values Index 402 114 0.796 0.771

Methods

The statistical methods of this report are described in the figure below. Since this procedure is non-trivial, I simplify it as the following:

For 10,000 bootstrap iterations and each TBC model:

  1. Define two bootstrap samples:

    • Bootstrapped training data

    • Bootstrapped holdout data

  2. Build a bootstraped probability model using training data

  3. With this model, score the:

    • Bootstrapped training sample

    • Bootstrapped holdout sample

    • Original training sample

    • Original holdout sample

  4. For each scored data set, build 500 classifications using cutoff probabilities in a linear grid across (0, 1)

  5. For each classifier, calculate Matthew’s Correlation Coefficient

  6. For each cutoff value, calculate:

    • The optimism for MCC

    • Bias-corrected accelerated (BCa) 95% confidence intervals for training data

    • Bootstrapped percentile 95% confidence intervals for holdout data

  7. Lastly, for each model, find the cutoff value with the largest mean bootstrapped MCC value, i.e., the “optimal cutoff”.

Note: MCC data related to the holdout sample use bootstrapped percentile confidence intervals because the fitted probability models were based on training data. Hence, BCa intervals aren’t meaningful for MCC holdout data.

# For LaTeX code used to generate this png, see:
# https://github.com/wfrierson/predictions_in_the_bell_curve/blob/master/200%20Code/Bootstrap%20Validation%20Algorithm.tex
knitr::include_graphics(file.path(code.folder, 'Bootstrap Validation Algorithm.png'))
Procedure used to produce distributions of MCC values with optimism correction

Procedure used to produce distributions of MCC values with optimism correction

These steps are implemented in R using the following scripts within this repository:

  • 000_Misc.R: Attaches needed packages and defines convenience functions

  • 100_Resampling.R: Custom functions to perform resampling techniques via bootstrapping and the jackknife method

  • 200_Classification.R: Custom functions to create classifications from modeled probabilities and then quantify their performance

  • 300_Visualizations.R: Custom functions to create various visualizations to understand the predictive performance of binary classifications

  • 400_Model_Definitions.R: Parameters used to define and reproduce logistic regression models from The Bell Curve

  • TBC_Bootstrap.R: Recursive code that executes a batch run for each TBC probability model. This script is the workhorse of the entire project. See its code for details.

Note: When 10,000 bootstrap iterations are used on all 24 models in the prior script (with parallel computing), the entire recursive batch process:

  • Produces 193 files in the “100 Data” folder, 6.56 MBs in total
  • Takes 4 - 5 hours

Results

In this section, I present two types of results. First, I walkthrough output from the poverty model. The discussion here is intended to help readers understand all the model output displayed in the appendix (if that detail is desired). Second, I summarize the optimal MCC values as evaluated on the training and holdout datasets for each model.

Walkthrough of Poverty Model

Model Parameters

Since reproducibility is critical for this project, I display the model parameters associated with the original, non-bootstrapped poverty model. These values should correspond to the ones listed on P. 620 of TBC. (See the appendix on reproducing TBC models for details.)

model <- 'poverty'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Poverty
Dependent variable:
Pov89
zAFQT89 -0.83767***
(0.09351)
zSES -0.33008***
(0.09010)
zAge -0.02384
(0.07237)
Constant -2.64877***
(0.07688)
Observations 3,367
Log Likelihood -784.40180
Akaike Inf. Crit. 1,576.80400
Note: p<0.1; p<0.05; p<0.01

Plotting Modeled Probabilities

The three plots below show the average of the poverty target and the modeled probabilities across a banded version of the three main variables: zAFQT89, zSES, and zAge.

    The black data points represent averages of the poverty target for NLSY79 participants (i.e,. observed data). The blue line represents the averaged modeled probabilities calculated from the original training data. The light blue ribbon represents 95% (BCa) confidence intervals for the modeled probability. The bar chart above the main plot reflects the proportion of NLSY79 participants included for each bucket of the x-axis. (This helps indicate which buckets have more or less weight.)
    Both zAFQT89 and zSES tend to have a monotonically decreasing relationship with the poverty target. The plot with zAFQT89 is the least noisy of the 3 and that of zAge is the most noisy. By inspection, the modeled probability lines show that zAFQT89 and zSES influence the poverty target, but do not capture a significant amount of variance in the data. This is also represented by the 95% confidence intervals for zAFQT89, since some data points (even some with large weights, according to the bar chart) lie outside the intervals.

zAFQT89

model <- 'poverty'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed and bootstrapped modeled probabilities across zAFQT89 for Poverty model (Training Data)

Plot of observed and bootstrapped modeled probabilities across zAFQT89 for Poverty model (Training Data)

zSES

model <- 'poverty'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
     , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed and bootstrapped modeled probabilities across zSES for Poverty model (Training Data)

Plot of observed and bootstrapped modeled probabilities across zSES for Poverty model (Training Data)

zAge

model <- 'poverty'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
     , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed and bootstrapped modeled probabilities across zAge for Poverty model (Training Data)

Plot of observed and bootstrapped modeled probabilities across zAge for Poverty model (Training Data)

Plotting MCC Distributions

The two plots below show the bootstrapped distributions of MCC values across a dense grid of cutoff values. The first plot is the MCC distribution when evaluated on the training data, the second is that using the holdout data.

    The solid blue line represents the MCC values calculated from the original data (hence its large noise). The dashed blue line represents the mean bootstrapped MCC values. The light blue ribbon represents the 95% confidence intervals for MCC values. The bar chart above the main plot represents the proportion of non-NULL MCC values calculated for that cutoff value. (MCC is undefined when its denominator is 0, which can happen when there are no true positive classifications). Lastly, the black vertical dashed line identifies the optimal cutoff probability within that dataset. The adjacent table shows the optimal cutoff probabilities and MCC values (split by the mean bootstrapped value, apparent value from the original data, and optimism-corrected value), including their 95% confidence intervals.
model <- 'poverty'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
  , caption = 'Table of optimal MCC values with bootstrapped confidence intervals'
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Table of optimal MCC values with bootstrapped confidence intervals
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1603206 0.1741112 0.2310828 0.2278948 0.2283286 0.2829347
Holdout 0.1683367 0.2601890 0.3483250 0.3382416 0.3477546 0.4139740

Training Data

model <- 'poverty'

plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

Holdout Data

model <- 'poverty'

plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plotting Modeled Classifications

The two plots below show the averages of the poverty target, modeled probabilities, and the classifications using the cutoff value associated with the optimal MCC value from the training data vs a banded version of zAFQT89. The first plot shows results using the training data, the second is that using the holdout data.

    The black data points represent averages of the poverty target for NLSY79 participants (same as before). The blue line represents the averaged modeled classifications calculated from the original data (and using the optimal cutoff of 16.03%). The light blue ribbon represents 95% confidence intervals for the modeled classifications. The grey line represents the averaged modeled probabilities calculated from the original data. The light grey ribbon represents 95% confidence intervals for the modeled probability. The bar chart above the main plot reflects the proportion of NLSY79 participants included for each bucket of the x-axis (same as before).
    Overall, the average classifications poorly reflect the observed values for both datasets, which is consistent with their optimal MCC values below 35%. If the model performed very well, the two confidence intervals would have larger overlapping regions across the zAFQT89 buckets. For these two plots, the confidence intervals rapidly deviate from each other for zAFQT89 values below -1.

Training Data

plotFrequencyVsClassification.bootCI(
    DT.factors = inputs[['poverty']][['DT.model.training']]
    , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[['poverty']][['zAFQT89']]
    , DT.probabilities.bootCI = DTs.probability.bootCI.training[['poverty']][['zAFQT89']]
    , xString = 'zAFQT89'
    , yString = outputs[label == 'poverty', response]
    , cutoff = outputs[label == 'poverty', cutoff.mcc.training]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = outputs[label == 'poverty', target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities for Poverty model (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities for Poverty model (Training Data)

Holdout Data

plotFrequencyVsClassification.bootCI(
    DT.factors = inputs[['poverty']][['DT.model.holdout']]
    , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[['poverty']][['zAFQT89']]
    , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[['poverty']][['zAFQT89']]
    , xString = 'zAFQT89'
    , yString = outputs[label == 'poverty', response]
    , cutoff = outputs[label == 'poverty', cutoff.mcc.training]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = outputs[label == 'poverty', target.descriptions.formatted]
)
Comparing bootstrapped classifications and modeled probabilities for Poverty model (Holdout Data)

Comparing bootstrapped classifications and modeled probabilities for Poverty model (Holdout Data)

Comparison of Optimal MCC Values

The main output of this project are the following two plots, which compare the optimal MCC values between the training and holdout datasets for each model. The first plot includes models where the average bootstrapped p-value for zAFQT89 is under 5% (i.e., models where IQ tend to be statistically significant); and for the second plot, models with that average p-value above 5% (i.e., models where IQ tended to not be statistically significant). For each model, 95% confidence intervals are shown for both training and holdout data.

Note: I’m not aware of statisticians averaging bootstrapped p-values of individual coefficients. Averaging p-values is only used here to group models according to their overall noise. As shown in the plots, models with this average p-value above 5% tend to have very large 95% confidence intervals – so large that they clutter the combined plot!

    The x-axis represents the optimal MCC values as evaluated on the training data, the y-axis is that evaluated on the holdout data. To simplify the plot, the models are grouped together by category. With the exception of a few education-related TBC models, all other TBC models have optimal MCC values below 50% for either training or holdout data.

Statistically Significant zAFQT89 (5% level)

ggplot(
  outputs[
    , .(
      label
      , mcc.optimism.corrected.training
      , mcc.bootstrap.lower.BCa.training
      , mcc.bootstrap.upper.BCa.training
      , mcc.optimism.corrected.holdout
      , mcc.bootstrap.lower.bootPercentile.holdout
      , mcc.bootstrap.upper.bootPercentile.holdout
      , Category = category
    )
  ][
    # Only plot models where the p-value for zAFQT89 is less than 0.05
    # THe other models should be plotted, but their BCa intervals don't contain the sample MCC value. 
    # Their larger p-values indicates a very weak model
    label %in% DT.coef.pValue[zAFQT89 <= 0.05]$label
  ]
  , aes(
    x = mcc.optimism.corrected.training
    , y = mcc.optimism.corrected.holdout
    , group = Category
    , color = Category
    , shape = Category
  )
) + 
  geom_point(size = 4, stroke = 1.5) + 
  geom_errorbar(
    aes(
      ymin = mcc.bootstrap.lower.bootPercentile.holdout
      , ymax = mcc.bootstrap.upper.bootPercentile.holdout
    )
    , color = 'black'
    , alpha = 0.5
  ) +
  geom_errorbarh(
    aes(
      xmin = mcc.bootstrap.lower.BCa.training
      , xmax = mcc.bootstrap.upper.BCa.training
    )
    , color = 'black'
    , alpha = 0.35
  ) +
  scale_color_brewer(palette = "Dark2") +
  xlab('Optimal MCC Value (Training Data)') + 
  ylab('Optimal MCC Value (Holdout Data)') + 
  lims(x = c(-0.2,1), y = c(-0.2,1)) + 
  theme_bw() + 
  theme(
    axis.text=element_text(size=16)
    , axis.title=element_text(size=16)
    , legend.text=element_text(size=16)
    , legend.title=element_text(size=16)
  )
Comparing predictive power of TBC models via optimism corrected MCC values

Comparing predictive power of TBC models via optimism corrected MCC values

Not Statistically Significant zAFQT89 (5% level)

ggplot(
  outputs[
    , .(
      label
      , mcc.optimism.corrected.training
      , mcc.bootstrap.lower.BCa.training
      , mcc.bootstrap.upper.BCa.training
      , mcc.optimism.corrected.holdout
      , mcc.bootstrap.lower.bootPercentile.holdout
      , mcc.bootstrap.upper.bootPercentile.holdout
      , Category = category
    )
  ][
    label %in% DT.coef.pValue[zAFQT89 > 0.05]$label
  ]
  , aes(
    x = mcc.optimism.corrected.training
    , y = mcc.optimism.corrected.holdout
    , group = Category
    , color = Category
    , shape = Category
  )
) + 
  geom_point(size = 4, stroke = 1.5) + 
  geom_errorbar(
    aes(
      ymin = mcc.bootstrap.lower.bootPercentile.holdout
      , ymax = mcc.bootstrap.upper.bootPercentile.holdout
    )
    , color = 'black'
    , alpha = 0.5
  ) +
  geom_errorbarh(
    aes(
      xmin = mcc.bootstrap.lower.BCa.training
      , xmax = mcc.bootstrap.upper.BCa.training
    )
    , color = 'black'
    , alpha = 0.35
  ) +
  scale_color_brewer(palette = "Dark2") +
  xlab('Optimal MCC Value (Training Data)') + 
  ylab('Optimal MCC Value (Holdout Data)') + 
  lims(x = c(-0.35,1), y = c(-0.35,1)) + 
  theme_bw() + 
  theme(
    axis.text=element_text(size=16)
    , axis.title=element_text(size=16)
    , legend.text=element_text(size=16)
    , legend.title=element_text(size=16)
  )
Comparing predictive power of TBC models via optimism corrected MCC values

Comparing predictive power of TBC models via optimism corrected MCC values

Discussion

What does the plot comparing optimal MCC values mean? Using the guidelines for interpreting MCC values listed above, nearly all TBC models have a low correlation (i.e., below 50%) between their observed binary target and the bootstrapped classification when evaluated on either training or holdout data. This suggests TBC’s probability models are not materially predictive and so they inadequately support the authors’ public policy recommendations.

    The only TBC models with optimal MCC values near 50% are those with binary targets related to education: receiving a GED, receiving a Bachelor’s, or dropping out of school. Although I am not a social psychologist, these results seem plausible, i.e., having lower IQ has a stronger association with having academic issues. Again using the above guidelines for interpreting MCC values, the optimal MCC values for these models are “moderate.” While these models are relatively more predictive than the other TBC models, they alone do not warrant the authors’ recommendations.

Informal Notes

  1. Much more can be said about these models, but that would lengthen this already long report!

  2. Future work could involve the remaining ~12 TBC models, but I don’t expect to have time to do so.

  3. I’m aware bootstrap percentile confidence intervals may not be reliable. However, I’ve included them for holdout data as a soft comparison with the results from the training data.

  4. The TBC_Bootstrap procedure technically doesn’t set a random seed for the boot package validation work. In a future version, I could do this and rerun the procedure. However, I don’t think it would change any conclusions of this report.

  5. I chose 10,000 bootstrap iterations because it seemed excessive. Prelim work with the poverty model (not included here) suggested that even 5,000 iterations was adequate. I wanted this large enough that a reader wouldn’t discredit the conclusions for having too few bootstrap iterations.

Appendix

Details on Reproducing TBC Models

The 24 TBC models can be reproduced in R with the following formulae and subset definitions for the training data. (The holdout data has the same subset definition per model but with Sample == ‘Supplement’.) These reproductions can be confirmed by comparing the regression tables in the Model Output section of this paper with the diagnostic data shown on TBC appendix pages cited below.

kable(
  outputs[tbc.models[, .(label, labelAppendix)], on = 'label'][
    , .(
      `Model` = labelAppendix
      , `R Code to Specify Subset of Training Data` = filter.training
      , `Model Formula` = formula
      , `TBC Appendix Page for Same Model` = appendix.page
    )
  ]
  , align = c(rep('l', 3), 'c')
) %>% kable_styling(
  bootstrap_options = "striped"
  , full_width = F
)
Model R Code to Specify Subset of Training Data Model Formula TBC Appendix Page for Same Model
Poverty Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1 Pov89 ~ zAFQT89 + zSES + zAge 620
Poverty (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1 & EdSample == ‘HS’ Pov89 ~ zAFQT89 + zSES + zAge 620
Dropout Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA) LTHSvHS_Ind ~ zAFQT89 + zSES + zAge 620
Dropout (With Interaction) Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA) LTHSvHS_Ind ~ zAFQT89 + zSES + zAge + zAFQT89:zSES 621
Get GED Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘LTHS’ & !is.na(AllvBA) GEDvHSGr_Ind ~ zAFQT89 + zSES + zAge 622
Get Bachelors Race4 == ‘White’ & Sample == ‘XSection’ & !(BA_Atta == ‘InSchl/NoD’ & AllvBA == 0 & is.na(AllvBA)) AllvBA ~ zAFQT89 + zSES + zAge 622
Out of the Labor Force Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ MoOLF89 ~ zAFQT89 + zSES + zAge 623
Out of the Labor Force (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘HS’ MoOLF89 ~ zAFQT89 + zSES + zAge 623
Out of the Labor Force (College Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘College’ MoOLF89 ~ zAFQT89 + zSES + zAge 624
Unemployed Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) MoUnemp8 ~ zAFQT89 + zSES + zAge 624
Unemployed (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘HS’ MoUnemp8 ~ zAFQT89 + zSES + zAge 624
Unemployed (College Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘College’ MoUnemp8 ~ zAFQT89 + zSES + zAge 625
Ever Married Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) WedBy30_Ind ~ zAFQT89 + zSES + zAge 625
Ever Married (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘HS’ WedBy30_Ind ~ zAFQT89 + zSES + zAge 626
Ever Married (College Subset) Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘College’ WedBy30_Ind ~ zAFQT89 + zSES + zAge 626
Divorced in 1st 5 Years of Marriage Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate 626
Divorced in 1st 5 Years of Marriage (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘HS’ Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate 627
Divorced in 1st 5 Years of Marriage (College Subset) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘College’ Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate 627
Divorced in 1st 5 Years of Marriage (With Parents Factors) Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & Adult14S != ’’ Div5Yrs ~ zAFQT89 + zSES + zAge + Adult14S 627
Surveyed in Jail Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’ Jail ~ zAFQT89 + zSES + zAge 645
Surveyed in Jail (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’ & EdSample == ‘HS’ Jail ~ zAFQT89 + zSES + zAge 645
“Middle Class Values” Race4 == ‘White’ & Sample == ‘XSection’ MCV_Inde ~ zAFQT89 + zSES + zAge 646
“Middle Class Values” (Highschool Subset) Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘HS’ MCV_Inde ~ zAFQT89 + zSES + zAge 646
“Middle Class Values” (College Subset) Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘College’ MCV_Inde ~ zAFQT89 + zSES + zAge 647

Table of Optimal MCC Values

Training Data

kable(
  outputs[tbc.models[, .(label, labelAppendix)], on = 'label'][
    order(-mcc.optimism.corrected.training)
    , .(
      Model = labelAppendix
      , Category = category
      , `Cutoff Probability` = cutoff.mcc.training
      , `Lower Bound` = mcc.bootstrap.lower.BCa.training
      , `Apparent` = mcc.apparent.training
      , `Bootstrapped Mean` = mcc.bootstrap.training
      , `Optimism Corrected` = mcc.optimism.corrected.training
      , `Upper Bound` = mcc.bootstrap.upper.BCa.training
    )
  ]
  , align = rep('c', 8)
  , caption = 'Table of Optimal MCC values by TBC Model (Training Data)'
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 4, 'Optimal MCC' = 3, ' ' = 1))
Table of Optimal MCC values by TBC Model (Training Data)
Optimal MCC
Model Category Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Get Bachelors Education 0.2665331 0.5255255 0.5546097 0.5551713 0.5535824 0.5811365
Dropout Education 0.2044088 0.4533255 0.4975504 0.4954082 0.4961282 0.5405721
Dropout (With Interaction) Education 0.2184369 0.4341434 0.4885867 0.4939417 0.4868035 0.5241504
Out of the Labor Force (College Subset) Employment 0.2024048 -0.0232670 0.3526263 0.3039617 0.2963358 0.5970251
“Middle Class Values” “Middle Class Values” 0.4348697 0.2339730 0.2710309 0.2731777 0.2692391 0.3033458
Poverty Poverty 0.1603206 0.1741112 0.2310828 0.2278948 0.2283286 0.2829347
Surveyed in Jail (Highschool Subset) Crime 0.2244489 -0.0104750 0.3518185 0.3445395 0.2199638 0.5765422
Ever Married (Highschool Subset) Marriage 0.7054108 0.0942768 0.2312680 0.1916036 0.2092067 0.3823478
Poverty (Highschool Subset) Poverty 0.0881764 0.1327124 0.1912472 0.1821463 0.1831268 0.2649095
Get GED Education 0.0641283 0.1575783 0.1825203 0.1775072 0.1801284 0.2153326
Divorced in 1st 5 Years of Marriage (College Subset) Marriage 0.0901804 -0.0555556 0.2374482 0.2598149 0.1755364 0.3716570
Surveyed in Jail Crime 0.0260521 0.1043961 0.1638284 0.1611350 0.1565315 0.2063444
“Middle Class Values” (College Subset) “Middle Class Values” 0.7374749 -0.0541101 0.1543930 0.1641656 0.1206757 0.2655808
Unemployed Employment 0.0981964 0.0579804 0.1268596 0.1159671 0.1173604 0.2025520
Out of the Labor Force (Highschool Subset) Employment 0.0921844 0.0571262 0.1365957 0.1147173 0.1127359 0.2549812
Divorced in 1st 5 Years of Marriage Marriage 0.1903808 0.0691469 0.1197176 0.1241194 0.1122616 0.1623367
Unemployed (Highschool Subset) Employment 0.1803607 -0.0194870 0.1522530 0.1113602 0.1112092 0.3195728
Out of the Labor Force Employment 0.1062124 0.0387634 0.1009407 0.1035068 0.0905966 0.1489757
Divorced in 1st 5 Years of Marriage (With Parents Factors) Marriage 0.1943888 0.0151819 0.0909336 0.1080654 0.0773499 0.1202274
“Middle Class Values” (Highschool Subset) “Middle Class Values” 0.5771543 0.0140331 0.0845855 0.0777591 0.0657706 0.1559252
Ever Married Marriage 0.8116232 0.0147045 0.0770474 0.0735238 0.0653208 0.1256629
Divorced in 1st 5 Years of Marriage (Highschool Subset) Marriage 0.2004008 -0.1043035 0.0448236 0.0829293 0.0161811 0.0802507
Ever Married (College Subset) Marriage 0.6693387 -0.2084305 -0.0098520 0.1014006 -0.0863671 0.0447238
Unemployed (College Subset) Employment 0.2625251 -0.0659163 -0.0149435 0.2189325 -0.1023747 0.2766287

Holdout Data

kable(
  outputs[tbc.models[, .(label, labelAppendix)], on = 'label'][
    order(-mcc.optimism.corrected.holdout)
    , .(
      Model = labelAppendix
      , Category = category
      , `Cutoff Probability` = cutoff.mcc.holdout
      , `Lower Bound` = mcc.bootstrap.lower.bootPercentile.holdout
      , `Apparent` = mcc.apparent.holdout
      , `Bootstrapped Mean` = mcc.bootstrap.holdout
      , `Optimism Corrected` = mcc.optimism.corrected.holdout
      , `Upper Bound` = mcc.bootstrap.upper.bootPercentile.holdout
    )
  ]
  , align = rep('c', 8)
  , caption = 'Table of Optimal MCC values by TBC Model (Holdout Data)'
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 4, 'Optimal MCC' = 3, ' ' = 1))
Table of Optimal MCC values by TBC Model (Holdout Data)
Optimal MCC
Model Category Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Get Bachelors Education 0.1503006 0.5964385 0.6521115 0.6459635 0.6525238 0.6939160
Dropout Education 0.1643287 0.5530523 0.6050726 0.6014355 0.6050119 0.6486212
Dropout (With Interaction) Education 0.1983968 0.5446363 0.6007592 0.5952430 0.6005778 0.6445841
Unemployed (College Subset) Employment 0.2084168 -0.0772503 0.4803922 0.2460685 0.4547918 0.8573214
“Middle Class Values” “Middle Class Values” 0.4709419 0.3932675 0.4511110 0.4506696 0.4512280 0.5077860
Poverty Poverty 0.1683367 0.2601890 0.3483250 0.3382416 0.3477546 0.4139740
Poverty (Highschool Subset) Poverty 0.1863727 0.1285619 0.3355610 0.3117212 0.3369449 0.4782756
Get GED Education 0.1162325 0.2220604 0.2904177 0.2858693 0.2902919 0.3481317
Surveyed in Jail Crime 0.0380762 0.1237940 0.2362658 0.2136281 0.2365009 0.2980373
Divorced in 1st 5 Years of Marriage (College Subset) Marriage 0.0721443 -0.1124844 0.2285707 0.1731752 0.2320964 0.4331122
Out of the Labor Force (Highschool Subset) Employment 0.1302605 -0.0959208 0.2190890 0.0852661 0.2158204 0.3744309
Out of the Labor Force Employment 0.1042084 -0.0096561 0.1715436 0.1453647 0.1719140 0.2652515
Unemployed Employment 0.0821643 0.0489113 0.1684439 0.1672231 0.1687170 0.2753726
Divorced in 1st 5 Years of Marriage Marriage 0.1543086 0.0550943 0.1618226 0.1380496 0.1615906 0.2112502
“Middle Class Values” (Highschool Subset) “Middle Class Values” 0.5430862 -0.1051121 0.1361782 0.0346021 0.1350699 0.1646227
Divorced in 1st 5 Years of Marriage (With Parents Factors) Marriage 0.1623246 0.0213149 0.1331039 0.1118409 0.1332435 0.1907022
Divorced in 1st 5 Years of Marriage (Highschool Subset) Marriage 0.1523046 -0.0941931 0.1155382 0.0713586 0.1158789 0.2201537
Surveyed in Jail (Highschool Subset) Crime 0.0040080 -0.1516913 0.0854826 0.0422934 0.0852632 0.1890059
Ever Married Marriage 0.7695391 -0.0430202 0.0630608 0.0466488 0.0623341 0.1385667
Ever Married (Highschool Subset) Marriage 0.9438878 -0.1825518 0.0336324 0.0312788 0.0341430 0.1174501
Out of the Labor Force (College Subset) Employment 0.0020040 -0.2745752 0.0309344 0.0248098 0.0290959 0.1792303
Unemployed (Highschool Subset) Employment 0.1402806 -0.0898027 -0.0299510 0.0978638 -0.0331602 0.4767900
Ever Married (College Subset) Marriage 0.6432866 -0.1408794 -0.0482243 0.0646447 -0.0496034 0.3081470
“Middle Class Values” (College Subset) “Middle Class Values” 0.6773547 -0.1091326 -0.0511336 0.0724688 -0.0597152 0.3261558

Model Output

cat(paste0('### ', tbc.models[label == 'poverty', labelAppendix]))

Poverty

cat('#### Model Parameters')

Model Parameters

model <- 'poverty'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Poverty
Dependent variable:
Pov89
zAFQT89 -0.83767***
(0.09351)
zSES -0.33008***
(0.09010)
zAge -0.02384
(0.07237)
Constant -2.64877***
(0.07688)
Observations 3,367
Log Likelihood -784.40180
Akaike Inf. Crit. 1,576.80400
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'poverty'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'poverty'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'poverty'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'poverty'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1603206 0.1741112 0.2310828 0.2278948 0.2283286 0.2829347
Holdout 0.1683367 0.2601890 0.3483250 0.3382416 0.3477546 0.4139740
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'poverty.hs', labelAppendix]))

Poverty (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'poverty.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Poverty (Highschool Subset)
Dependent variable:
Pov89
zAFQT89 -0.82673***
(0.16273)
zSES -0.36197**
(0.14998)
zAge 0.10492
(0.10946)
Constant -2.72378***
(0.12902)
Observations 1,236
Log Likelihood -325.26940
Akaike Inf. Crit. 658.53880
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'poverty.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'poverty.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'poverty.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'poverty.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0881764 0.1327124 0.1912472 0.1821463 0.1831268 0.2649095
Holdout 0.1863727 0.1285619 0.3355610 0.3117212 0.3369449 0.4782756
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'dropout', labelAppendix]))

Dropout

cat('#### Model Parameters')

Model Parameters

model <- 'dropout'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Dropout
Dependent variable:
LTHSvHS
zAFQT89 -1.72296***
(0.10281)
zSES -0.64776***
(0.08966)
zAge 0.05696
(0.06883)
Constant -2.85323***
(0.09395)
Observations 3,572
Log Likelihood -779.99040
Akaike Inf. Crit. 1,567.98100
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'dropout'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'dropout'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'dropout'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'dropout'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2044088 0.4533255 0.4975504 0.4954082 0.4961282 0.5405721
Holdout 0.1643287 0.5530523 0.6050726 0.6014355 0.6050119 0.6486212
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'dropout_interaction', labelAppendix]))

Dropout (With Interaction)

cat('#### Model Parameters')

Model Parameters

model <- 'dropout_interaction'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Dropout (With Interaction)
Dependent variable:
LTHSvHS
zAFQT89 -1.89377***
(0.11884)
zSES -0.94024***
(0.12505)
zAge 0.05227
(0.06827)
zAFQT89:zSES -0.41332***
(0.11878)
Constant -2.91432***
(0.10293)
Observations 3,572
Log Likelihood -773.90060
Akaike Inf. Crit. 1,557.80100
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'dropout_interaction'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'dropout_interaction'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'dropout_interaction'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'dropout_interaction'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2184369 0.4341434 0.4885867 0.4939417 0.4868035 0.5241504
Holdout 0.1983968 0.5446363 0.6007592 0.5952430 0.6005778 0.6445841
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'get_ged', labelAppendix]))

Get GED

cat('#### Model Parameters')

Model Parameters

model <- 'get_ged'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Get GED
Dependent variable:
GEDvHSGr
zAFQT89 -0.43253***
(0.08512)
zSES -0.60822***
(0.08375)
zAge -0.04164
(0.06624)
Constant -2.35485***
(0.06539)
Observations 3,494
Log Likelihood -915.28150
Akaike Inf. Crit. 1,838.56300
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'get_ged'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'get_ged'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'get_ged'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'get_ged'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0641283 0.1575783 0.1825203 0.1775072 0.1801284 0.2153326
Holdout 0.1162325 0.2220604 0.2904177 0.2858693 0.2902919 0.3481317
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'get_bachelors', labelAppendix]))

Get Bachelors

cat('#### Model Parameters')

Model Parameters

model <- 'get_bachelors'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Get Bachelors
Dependent variable:
AllvBA
zAFQT89 1.77473***
(0.07825)
zSES 1.01514***
(0.06784)
zAge -0.30316***
(0.05113)
Constant -2.41463***
(0.07783)
Observations 3,857
Log Likelihood -1,389.47300
Akaike Inf. Crit. 2,786.94500
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'get_bachelors'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'get_bachelors'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'get_bachelors'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'get_bachelors'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2665331 0.5255255 0.5546097 0.5551713 0.5535824 0.5811365
Holdout 0.1503006 0.5964385 0.6521115 0.6459635 0.6525238 0.6939160
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'oolf4wks', labelAppendix]))

Out of the Labor Force

cat('#### Model Parameters')

Model Parameters

model <- 'oolf4wks'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Out of the Labor Force
Dependent variable:
MoOLF89
zAFQT89 -0.36247***
(0.09928)
zSES 0.21788**
(0.10757)
zAge -0.12815
(0.08640)
Constant -2.20264***
(0.08680)
Observations 1,686
Log Likelihood -548.25140
Akaike Inf. Crit. 1,104.50300
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'oolf4wks'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'oolf4wks'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'oolf4wks'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'oolf4wks'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1062124 0.0387634 0.1009407 0.1035068 0.0905966 0.1489757
Holdout 0.1042084 -0.0096561 0.1715436 0.1453647 0.1719140 0.2652515
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'oolf4wks.hs', labelAppendix]))

Out of the Labor Force (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'oolf4wks.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Out of the Labor Force (Highschool Subset)
Dependent variable:
MoOLF89
zAFQT89 -0.42151*
(0.22643)
zSES 0.56490**
(0.23004)
zAge -0.14557
(0.16725)
Constant -2.69780***
(0.17674)
Observations 621
Log Likelihood -156.98050
Akaike Inf. Crit. 321.96090
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'oolf4wks.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'oolf4wks.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'oolf4wks.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'oolf4wks.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0921844 0.0571262 0.1365957 0.1147173 0.1127359 0.2549812
Holdout 0.1302605 -0.0959208 0.2190890 0.0852661 0.2158204 0.3744309
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'oolf4wks.col', labelAppendix]))

Out of the Labor Force (College Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'oolf4wks.col'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Out of the Labor Force (College Subset)
Dependent variable:
MoOLF89
zAFQT89 -0.84324*
(0.45267)
zSES 0.94515**
(0.38754)
zAge -0.46062
(0.29904)
Constant -3.12957***
(0.60817)
Observations 268
Log Likelihood -56.53686
Akaike Inf. Crit. 121.07370
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'oolf4wks.col'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'oolf4wks.col'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'oolf4wks.col'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'oolf4wks.col'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2024048 -0.0232670 0.3526263 0.3039617 0.2963358 0.5970251
Holdout 0.0020040 -0.2745752 0.0309344 0.0248098 0.0290959 0.1792303
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'unemployed4wks', labelAppendix]))

Unemployed

cat('#### Model Parameters')

Model Parameters

model <- 'unemployed4wks'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Unemployed
Dependent variable:
MoUnemp8
zAFQT89 -0.49487***
(0.12989)
zSES -0.02535
(0.13838)
zAge -0.02181
(0.11084)
Constant -2.53577***
(0.10760)
Observations 1,397
Log Likelihood -348.71510
Akaike Inf. Crit. 705.43020
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'unemployed4wks'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'unemployed4wks'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'unemployed4wks'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'unemployed4wks'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0981964 0.0579804 0.1268596 0.1159671 0.1173604 0.2025520
Holdout 0.0821643 0.0489113 0.1684439 0.1672231 0.1687170 0.2753726
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'unemployed4wks.hs', labelAppendix]))

Unemployed (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'unemployed4wks.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Unemployed (Highschool Subset)
Dependent variable:
MoUnemp8
zAFQT89 -0.39353*
(0.23687)
zSES 0.13952
(0.23532)
zAge -0.10511
(0.17625)
Constant -2.59878***
(0.17661)
Observations 537
Log Likelihood -140.49120
Akaike Inf. Crit. 288.98240
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'unemployed4wks.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'unemployed4wks.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'unemployed4wks.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'unemployed4wks.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1803607 -0.0194870 0.152253 0.1113602 0.1112092 0.3195728
Holdout 0.1402806 -0.0898027 -0.029951 0.0978638 -0.0331602 0.4767900
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'unemployed4wks.col', labelAppendix]))

Unemployed (College Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'unemployed4wks.col'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Unemployed (College Subset)
Dependent variable:
MoUnemp8
zAFQT89 -0.91969
(0.56414)
zSES 1.00393**
(0.50155)
zAge 0.29420
(0.33111)
Constant -3.16869***
(0.72765)
Observations 228
Log Likelihood -40.50613
Akaike Inf. Crit. 89.01227
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'unemployed4wks.col'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'unemployed4wks.col'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'unemployed4wks.col'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'unemployed4wks.col'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2625251 -0.0659163 -0.0149435 0.2189325 -0.1023747 0.2766287
Holdout 0.2084168 -0.0772503 0.4803922 0.2460685 0.4547918 0.8573214
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'ever_married30', labelAppendix]))

Ever Married

cat('#### Model Parameters')

Model Parameters

model <- 'ever_married30'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Ever Married
Dependent variable:
WedBy30
zAFQT89 -0.04736
(0.07579)
zSES -0.19055**
(0.07863)
zAge 0.20403
(0.12905)
Constant 1.19841***
(0.12890)
Observations 1,634
Log Likelihood -839.76750
Akaike Inf. Crit. 1,687.53500
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'ever_married30'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'ever_married30'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'ever_married30'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'ever_married30'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.8116232 0.0147045 0.0770474 0.0735238 0.0653208 0.1256629
Holdout 0.7695391 -0.0430202 0.0630608 0.0466488 0.0623341 0.1385667
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'ever_married30.hs', labelAppendix]))

Ever Married (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'ever_married30.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Ever Married (Highschool Subset)
Dependent variable:
WedBy30
zAFQT89 0.51425***
(0.15984)
zSES -0.11288
(0.15828)
zAge 0.36827
(0.24225)
Constant 1.41495***
(0.23427)
Observations 605
Log Likelihood -259.40300
Akaike Inf. Crit. 526.80590
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'ever_married30.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'ever_married30.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'ever_married30.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'ever_married30.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.7054108 0.0942768 0.2312680 0.1916036 0.2092067 0.3823478
Holdout 0.9438878 -0.1825518 0.0336324 0.0312788 0.0341430 0.1174501
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'ever_married30.col', labelAppendix]))

Ever Married (College Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'ever_married30.col'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Ever Married (College Subset)
Dependent variable:
WedBy30
zAFQT89 0.05014
(0.22375)
zSES 0.09683
(0.18337)
zAge -0.01778
(0.29509)
Constant 0.71372*
(0.39462)
Observations 237
Log Likelihood -145.35750
Akaike Inf. Crit. 298.71500
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'ever_married30.col'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'ever_married30.col'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'ever_married30.col'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'ever_married30.col'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.6693387 -0.2084305 -0.0098520 0.1014006 -0.0863671 0.0447238
Holdout 0.6432866 -0.1408794 -0.0482243 0.0646447 -0.0496034 0.3081470
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'divorced_in5yrs', labelAppendix]))

Divorced in 1st 5 Years of Marriage

cat('#### Model Parameters')

Model Parameters

model <- 'divorced_in5yrs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Divorced in 1st 5 Years of Marriage
Dependent variable:
Div5Yrs
zAFQT89 -0.35734***
(0.07813)
zSES 0.22195***
(0.07876)
zAge -0.17767**
(0.07415)
MarDate -0.08677***
(0.02431)
Constant 5.70861***
(1.98581)
Observations 2,031
Log Likelihood -991.37190
Akaike Inf. Crit. 1,992.74400
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'divorced_in5yrs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'divorced_in5yrs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'divorced_in5yrs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'divorced_in5yrs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1903808 0.0691469 0.1197176 0.1241194 0.1122616 0.1623367
Holdout 0.1543086 0.0550943 0.1618226 0.1380496 0.1615906 0.2112502
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'divorced_in5yrs.hs', labelAppendix]))

Divorced in 1st 5 Years of Marriage (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'divorced_in5yrs.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Divorced in 1st 5 Years of Marriage (Highschool Subset)
Dependent variable:
Div5Yrs
zAFQT89 -0.03792
(0.13481)
zSES 0.22069*
(0.12882)
zAge -0.10781
(0.11468)
MarDate -0.08400**
(0.03832)
Constant 5.44514*
(3.12869)
Observations 875
Log Likelihood -428.70640
Akaike Inf. Crit. 867.41290
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'divorced_in5yrs.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'divorced_in5yrs.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'divorced_in5yrs.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'divorced_in5yrs.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2004008 -0.1043035 0.0448236 0.0829293 0.0161811 0.0802507
Holdout 0.1523046 -0.0941931 0.1155382 0.0713586 0.1158789 0.2201537
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'divorced_in5yrs.col', labelAppendix]))

Divorced in 1st 5 Years of Marriage (College Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'divorced_in5yrs.col'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Divorced in 1st 5 Years of Marriage (College Subset)
Dependent variable:
Div5Yrs
zAFQT89 -0.75623*
(0.45023)
zSES -0.07353
(0.35889)
zAge -0.55879
(0.40470)
MarDate -0.41115**
(0.16298)
Constant 32.39423**
(13.50911)
Observations 209
Log Likelihood -48.41447
Akaike Inf. Crit. 106.82890
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'divorced_in5yrs.col'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'divorced_in5yrs.col'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'divorced_in5yrs.col'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'divorced_in5yrs.col'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0901804 -0.0555556 0.2374482 0.2598149 0.1755364 0.3716570
Holdout 0.0721443 -0.1124844 0.2285707 0.1731752 0.2320964 0.4331122
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'divorced_in5yrs_parents', labelAppendix]))

Divorced in 1st 5 Years of Marriage (With Parents Factors)

cat('#### Model Parameters')

Model Parameters

model <- 'divorced_in5yrs_parents'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Divorced in 1st 5 Years of Marriage (With Parents Factors)
Dependent variable:
Div5Yrs
zAFQT89 -0.39256***
(0.07742)
zSES 0.19104**
(0.07833)
zAge -0.02781
(0.06177)
Adult14S2 Bio -0.31954
(0.20437)
Adult14SBio/Step -0.11008
(0.25682)
Adult14SOther -0.25948
(0.37893)
Constant -1.12299***
(0.19200)
Observations 2,029
Log Likelihood -994.67710
Akaike Inf. Crit. 2,003.35400
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'divorced_in5yrs_parents'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'divorced_in5yrs_parents'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'divorced_in5yrs_parents'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'divorced_in5yrs_parents'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.1943888 0.0151819 0.0909336 0.1080654 0.0773499 0.1202274
Holdout 0.1623246 0.0213149 0.1331039 0.1118409 0.1332435 0.1907022
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'surveyed_in_jail', labelAppendix]))

Surveyed in Jail

cat('#### Model Parameters')

Model Parameters

model <- 'surveyed_in_jail'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Surveyed in Jail
Dependent variable:
Jail
zAFQT89 -0.89670***
(0.17536)
zSES -0.15554
(0.18062)
zAge 0.07830
(0.14687)
Constant -3.77722***
(0.17180)
Observations 1,945
Log Likelihood -219.90130
Akaike Inf. Crit. 447.80250
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'surveyed_in_jail'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'surveyed_in_jail'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'surveyed_in_jail'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'surveyed_in_jail'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.0260521 0.1043961 0.1638284 0.1611350 0.1565315 0.2063444
Holdout 0.0380762 0.1237940 0.2362658 0.2136281 0.2365009 0.2980373
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'surveyed_in_jail.hs', labelAppendix]))

Surveyed in Jail (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'surveyed_in_jail.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: Surveyed in Jail (Highschool Subset)
Dependent variable:
Jail
zAFQT89 -1.07007**
(0.44312)
zSES -0.16212
(0.46430)
zAge 0.46727
(0.36775)
Constant -4.96579***
(0.48063)
Observations 716
Log Likelihood -39.85058
Akaike Inf. Crit. 87.70117
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'surveyed_in_jail.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'surveyed_in_jail.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'surveyed_in_jail.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'surveyed_in_jail.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.2244489 -0.0104750 0.3518185 0.3445395 0.2199638 0.5765422
Holdout 0.0040080 -0.1516913 0.0854826 0.0422934 0.0852632 0.1890059
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'middle_class_values', labelAppendix]))

“Middle Class Values”

cat('#### Model Parameters')

Model Parameters

model <- 'middle_class_values'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: “Middle Class Values”
Dependent variable:
MCV
zAFQT89 0.63251***
(0.05282)
zSES 0.24496***
(0.05206)
zAge 0.00664
(0.04019)
Constant -0.06385
(0.03893)
Observations 3,029
Log Likelihood -1,937.43300
Akaike Inf. Crit. 3,882.86600
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'middle_class_values'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'middle_class_values'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'middle_class_values'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'middle_class_values'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.4348697 0.2339730 0.2710309 0.2731777 0.2692391 0.3033458
Holdout 0.4709419 0.3932675 0.4511110 0.4506696 0.4512280 0.5077860
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'middle_class_values.hs', labelAppendix]))

“Middle Class Values” (Highschool Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'middle_class_values.hs'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: “Middle Class Values” (Highschool Subset)
Dependent variable:
MCV
zAFQT89 0.16815*
(0.09312)
zSES -0.17993**
(0.09034)
zAge 0.01888
(0.06218)
Constant 0.39448***
(0.06118)
Observations 1,162
Log Likelihood -781.85690
Akaike Inf. Crit. 1,571.71400
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'middle_class_values.hs'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'middle_class_values.hs'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'middle_class_values.hs'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'middle_class_values.hs'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.5771543 0.0140331 0.0845855 0.0777591 0.0657706 0.1559252
Holdout 0.5430862 -0.1051121 0.1361782 0.0346021 0.1350699 0.1646227
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

cat(paste0('### ', tbc.models[label == 'middle_class_values.col', labelAppendix]))

“Middle Class Values” (College Subset)

cat('#### Model Parameters')

Model Parameters

model <- 'middle_class_values.col'
title.model <- paste0(
  'Reproduced TBC Probability Model: '
  , tbc.models[label == model, labelAppendix]
)
target.model <- tbc.models[label == model, response]

stargazer(
  outputs[label == model, glm.model][[1]]
  , type = 'html'
  , title = title.model
  , dep.var.labels = target.model
  , digits = 5
)
Reproduced TBC Probability Model: “Middle Class Values” (College Subset)
Dependent variable:
MCV
zAFQT89 0.39251**
(0.19881)
zSES 0.03692
(0.16858)
zAge 0.13876
(0.13364)
Constant 0.99516***
(0.23868)
Observations 402
Log Likelihood -200.09150
Akaike Inf. Crit. 408.18290
Note: p<0.1; p<0.05; p<0.01
cat('#### Plotting Modeled Probabilities{.tabset .tabset-fade}')

Plotting Modeled Probabilities

cat('##### zAFQT89')
zAFQT89
model <- 'middle_class_values.col'
factorSelection <- 'zAFQT89'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAFQT89

Plot of observed vs bootstrapped modeled probabilities across zAFQT89

cat('##### zSES')
zSES
model <- 'middle_class_values.col'
factorSelection <- 'zSES'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zSES

Plot of observed vs bootstrapped modeled probabilities across zSES

cat('##### zAge')
zAge
model <- 'middle_class_values.col'
factorSelection <- 'zAge'

plotFrequencyVsProbability.bootCI(
     DT.factors = inputs[[model]][['DT.model.training']]
    , DT.bootCI = DTs.probability.bootCI.training[[model]][[factorSelection]]
    , xString = factorSelection
    , yString = inputs[[model]][['model parameters']][, response]
    , key = 'ID'
    , bandingPrecision = 10
    , alpha = 0.05
    , response.label = inputs[[model]][['model parameters']][, target.descriptions.formatted]
)
Plot of observed vs bootstrapped modeled probabilities across zAge

Plot of observed vs bootstrapped modeled probabilities across zAge

cat('#### Plotting MCC Distributions{.tabset .tabset-fade}')

Plotting MCC Distributions

model <- 'middle_class_values.col'

kable(
  data.table(
    Dataset = c('Training', 'Holdout')
    , `Cutoff Probability` = c(
      outputs[label == (model), cutoff.mcc.training][[1]]
      , outputs[label == (model), cutoff.mcc.holdout][[1]]
    )
    , `Lower Bound` = c(
      outputs[label == (model), mcc.bootstrap.lower.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.lower.bootPercentile.holdout][[1]]
    )
    , `Apparent` = c(
      outputs[label == (model), mcc.apparent.training][[1]]
      , outputs[label == (model), mcc.apparent.holdout][[1]]
    )
    , `Bootstrapped Mean` = c(
      outputs[label == (model), mcc.bootstrap.training][[1]]
      , outputs[label == (model), mcc.bootstrap.holdout][[1]]
    )
    , `Optimism Corrected` = c(
      outputs[label == (model), mcc.optimism.corrected.training][[1]]
      , outputs[label == (model), mcc.optimism.corrected.holdout][[1]]
    )
    , `Upper Bound` = c(
      outputs[label == (model), mcc.bootstrap.upper.BCa.training][[1]]
      , outputs[label == (model), mcc.bootstrap.upper.bootPercentile.holdout][[1]]
    )
  )
  , align = rep('c', 7)
) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(' ' = 3, 'Optimal MCC' = 3, ' ' = 1))
Optimal MCC
Dataset Cutoff Probability Lower Bound Apparent Bootstrapped Mean Optimism Corrected Upper Bound
Training 0.7374749 -0.0541101 0.1543930 0.1641656 0.1206757 0.2655808
Holdout 0.6773547 -0.1091326 -0.0511336 0.0724688 -0.0597152 0.3261558
cat('##### Training')
Training
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.training[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Training Data)

Plot of bootstrapped MCC distributions by cutoff (Training Data)

cat('##### Holdout')
Holdout
plotPerformanceMetric.cutoff.bootCI(
  DT.bootCI = DT.mcc.bootCI.holdout[label == model]
  , performanceMetric = 'mcc'
  , performanceMetric.boot = 'mcc.bootstrap'
  , cutoff.name = 'cutoff'
  , lowerCI.name = 'lower'
  , upperCI.name = 'upper'
)
Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

Plot of bootstrapped MCC distributions by cutoff (Holdout Data)

cat('#### Plotting Modeled Classifications{.tabset .tabset-fade}')

Plotting Modeled Classifications

cat('##### Training')
Training
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.training']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.training[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.training[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

Plot comparing bootstrapped classifications and modeled probabilities (Training Data)

cat('##### Holdout')
Holdout
plotFrequencyVsClassification.bootCI(
  DT.factors = inputs[[model]][['DT.model.holdout']]
  , DT.classification.bootCI = DTs.mcc.classification.bootCI.holdout[[model]][['zAFQT89']]
  , DT.probabilities.bootCI = DTs.probability.bootCI.holdout[[model]][['zAFQT89']]
  , xString = 'zAFQT89'
  , yString = outputs[label == model, response]
  , cutoff = outputs[label == model, cutoff.mcc.training]
  , key = 'ID'
  , bandingPrecision = 10
  , alpha = 0.05
  , response.label = outputs[label == model, target.descriptions.formatted]
)
Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Plot comparing bootstrapped classifications and modeled probabilities (Holdout Data)

Validating Results with boot Package

DT.compare.bootCI <- outputs[
  , .(
    label = factor(label
                   , levels = rev(levels(label))
            )
      , mcc.bootstrap.lower.BCa.training, mcc.apparent.training, mcc.bootstrap.upper.BCa.training
      , lower.mcc.bootPackage, optimal.mcc.bootPackage, upper.mcc.bootPackage
      , lb.diff = mcc.bootstrap.lower.BCa.training - lower.mcc.bootPackage
      , ub.diff = mcc.bootstrap.upper.BCa.training - upper.mcc.bootPackage
  )
][, maxAbs.diff := pmax(abs(lb.diff), abs(ub.diff))]

DT.compare.bootCI.melt <- melt(
  DT.compare.bootCI
  , id.vars = c('label', 'lb.diff', 'ub.diff', 'maxAbs.diff')
  , measure.vars = patterns('lower','(apparent|optimal)','upper')
  , variable.name = 'implementation'
  , value.name = c('lower', 'estimate', 'upper')
)

DT.compare.bootCI.melt[, label := factor(label, levels = DT.compare.bootCI[order(-maxAbs.diff), label], ordered = TRUE)]

implementation.names <- c('Custom BCa for data.table', 'boot package')
DT.compare.bootCI.melt[
  , implementation := factor(
    implementation.names[c(rep(1,nrow(DT.compare.bootCI)), rep(2,nrow(DT.compare.bootCI)))]
    , levels = implementation.names
  )
]

One way to test whether the results of this analysis are reasonable is to reproduce part of it using the boot package, which facilitates bootstrapping and can calculate BCa intervals. In the script used to store the project’s results (i.e., TBC_Bootstrap), the boot::boot function performs the following steps:

  1. Fits a probability model to a bootstrapped version of each model’s training data

  2. Creates a classification using the latter model’s fitted values and the optimal cutoff value calculated from the custom bootstrap functions used in this project

  3. Calculates the MCC value for that bootstrapped model

    The plot below compares the MCC estimates and BCa intervals as calculated by this analysis as well as the boot package. The MCC estimates are identical and the BCa intervals tend to agree very well. The model labels are ordered by the largest difference in BCa bounds. E.g., if the difference in lower BCa bounds for the Get GED model is 3.866694610^{-5}, and for the upper BCa bounds 4.720284610^{-4}, then the largest difference in BCa bounds for this model is 4.720284610^{-4}. The model with the most extreme difference in BCa bounds is Unemployed (Highschool Subset) with a value of 0.0393696.
ggplot(
  DT.compare.bootCI.melt
  , aes(
    x = label
    , ymin = lower
    , y = estimate
    , ymax = upper
    , color = implementation
    , group = implementation
  )
) + 
  geom_point() +
  geom_errorbar(alpha = 0.5) + 
  theme_bw() +
  coord_flip()
Plot comparing bootstrapped MCC estimates and confidence intervals measured by this analysis and the boot package

Plot comparing bootstrapped MCC estimates and confidence intervals measured by this analysis and the boot package

kable(
  DT.compare.bootCI[tbc.models[, .(label, labelAppendix)], on = 'label'][order(maxAbs.diff)][
    , .(
      `Model` = labelAppendix
      , `Largest Difference in BCa bounds` = maxAbs.diff
    )
  ]
  , align = c('l', 'c')
  , caption = 'Table comparing bootstrapped MCC estimates and confidence intervals measured by this analysis and the boot package'
) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Table comparing bootstrapped MCC estimates and confidence intervals measured by this analysis and the boot package
Model Largest Difference in BCa bounds
Get Bachelors 0.0004678
Get GED 0.0004720
Out of the Labor Force 0.0006093
“Middle Class Values” (College Subset) 0.0007759
Dropout 0.0011253
Ever Married 0.0011459
Poverty (Highschool Subset) 0.0013201
“Middle Class Values” 0.0019633
Unemployed 0.0022801
Divorced in 1st 5 Years of Marriage 0.0022979
“Middle Class Values” (Highschool Subset) 0.0023998
Out of the Labor Force (Highschool Subset) 0.0030471
Dropout (With Interaction) 0.0035205
Ever Married (Highschool Subset) 0.0036973
Poverty 0.0040212
Divorced in 1st 5 Years of Marriage (With Parents Factors) 0.0050280
Surveyed in Jail 0.0066849
Divorced in 1st 5 Years of Marriage (College Subset) 0.0177187
Unemployed (College Subset) 0.0181666
Out of the Labor Force (College Subset) 0.0227026
Divorced in 1st 5 Years of Marriage (Highschool Subset) 0.0301332
Surveyed in Jail (Highschool Subset) 0.0336793
Ever Married (College Subset) 0.0377535
Unemployed (Highschool Subset) 0.0393696

Session Info

R

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2   kableExtra_0.9.0     stargazer_5.2.2     
##  [4] knitr_1.20           plotly_4.8.0         boot_1.3-20         
##  [7] snow_0.4-3           cowplot_0.9.3        ggplot2_3.1.0       
## [10] data.table_1.11.8    RevoUtils_11.0.1     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5  purrr_0.2.5       colorspace_1.3-2 
##  [4] htmltools_0.3.6   viridisLite_0.3.0 yaml_2.2.0       
##  [7] rlang_0.3.0.1     later_0.7.5       pillar_1.3.0     
## [10] glue_1.3.0        withr_2.1.2       bindrcpp_0.2.2   
## [13] bindr_0.1.1       plyr_1.8.4        stringr_1.3.1    
## [16] munsell_0.5.0     gtable_0.2.0      rvest_0.3.2      
## [19] htmlwidgets_1.3   evaluate_0.12     labeling_0.3     
## [22] httpuv_1.4.5      crosstalk_1.0.0   parallel_3.5.1   
## [25] highr_0.7         Rcpp_1.0.0        xtable_1.8-3     
## [28] readr_1.1.1       promises_1.0.1    scales_1.0.0     
## [31] backports_1.1.2   jsonlite_1.5      mime_0.6         
## [34] hms_0.4.2         digest_0.6.18     stringi_1.2.4    
## [37] dplyr_0.7.8       shiny_1.2.0       grid_3.5.1       
## [40] rprojroot_1.3-2   tools_3.5.1       magrittr_1.5     
## [43] lazyeval_0.2.1    tibble_1.4.2      crayon_1.3.4     
## [46] tidyr_0.8.2       pkgconfig_2.0.2   xml2_1.2.0       
## [49] assertthat_0.2.0  rmarkdown_1.10    httr_1.3.1       
## [52] rstudioapi_0.7    R6_2.2.2          compiler_3.5.1

RStudio

Version 1.1.463

References

Software

  1. R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

  2. Matt Dowle and Arun Srinivasan (2018). data.table: Extension of data.frame. R package version 1.11.8. https://CRAN.R-project.org/package=data.table

  3. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

  4. Luke Tierney, A. J. Rossini, Na Li and H. Sevcikova (2018). snow: Simple Network of Workstations. R package version 0.4-3. https://CRAN.R-project.org/package=snow

  5. Claus O. Wilke (2018). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 0.9.3. https://CRAN.R-project.org/package=cowplot

  6. Yihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20.

  7. Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

  8. Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

  9. Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer

  10. Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.1. https://CRAN.R-project.org/package=stargazer

  11. Hao Zhu (2018). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 0.9.0. https://CRAN.R-project.org/package=kableExtra

  12. RStudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.

Publications


  1. Herrnstein and Murray. (1994). The Bell Curve. New York: The Free Press. 1st paperback edition 1996. ISBN-13: 978-0-684-82429

  2. Bureau of Labor Statistics, U.S. Department of Labor. National Longitudinal Survey of Youth 1979 cohort, 1979-2012 (rounds 1-25). Produced and distributed by the Center for Human Resource Research, The Ohio State University. Columbus, OH: 2014.

  3. prepared for the U.S. Department of Labor by Center for Human Resource Research, The Ohio State University. (2001). NLSY79 users’ guide : a guide to the 1979-2000 National Longitudinal Survey of Youth data. Columbus, Ohio :Center for Human Resource Research, Ohio State University.

  4. Heckman, J. (1995). Lessons from The Bell Curve. Journal of Political Economy, 103(5), 1091–1120.

  5. Krenz, C. (2007, August 8). Anatomy of an Analysis. Retrieved from http://www.claudiax.net/bell.html.

  6. Shmueli, G. (2010). To Explain or To Predict? Statistical Science, 25(3), 289–310.

  7. Miele, F. (1995). For Whom The Bell Curve Tolls. Skeptic, Volume 3, #3, 34 – 41.

  8. Mukaka MM. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69-71.

  9. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6).

  10. Efron B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78:316-331.

  11. Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Tutorial in Biostatistics: Multivariable prognostic models. Statistics in Medicine, 15:361-387.

  12. Efron B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78:316-331.